Azure Data Explorer Kusto: group by with difference in datetime - azure-data-explorer

Is there a way to find datetime difference grouping by a column in Data Explorer Kusto? I would like to find out total time spent by each traveler in Spain.
A traveler is considered to be in a country from the time they arrive in that country till the time they arrive in their next destination. Here are the edge cases:
If TripComplete == 'Yes' and that country is the last visited destination the difference, the trip is ongoing and time spent should be difference between now() and entry time.
If TripComplete == 'No' and that country is the last visited destination it is considered as the end of trip.
Say for Spain, the time spent in Spain should be time spent in Spain in previous trips by the traveler + time since their last entry in that city of Spain (now() - EntryTime).
Here is the expected result.
TravellerId result
1 05:00:00 [Madrid to Barcelona + Barcelona to London]
2 00:00:00 [Trip complete]
3 1.00:00:00 [now() - Malaga EntryTime]
4 05:00:00 [now() - Malaga EntryTime]
5 2:00:00 [Malaga to London]
6 1.16:00:00 [Madrid to Barcelona + (now() - Barcelona EntryTime)]
7 11:00:00 [Madrid to London + Barcelona to Beiging]
Source:
set query_now = datetime(2020-02-04 5:00:00);
datatable(TravellerId:int, Country:string, City:string, TripComplete:string, EntryTime: datetime)
[
1, 'China', 'Beiging', 'Yes', datetime(2020-02-02 12:00:00),
1, 'Spain', 'Madrid', 'Yes',datetime(2020-02-02 13:00:00),
1, 'Spain', 'Barcelona', 'Yes',datetime(2020-02-02 15:00:00),
1, 'UK', 'London', 'Yes', datetime(2020-02-02 18:00:00),
2, 'Spain', 'Malaga', 'Yes', datetime(2020-02-03 5:00:00),
3, 'Spain', 'Malaga', 'No', datetime(2020-02-03 5:00:00),
4, 'China', 'Beiging', 'No', datetime(2020-02-03 5:00:00),
4, 'Spain', 'Malaga', 'No', datetime(2020-02-04 00:00:00),
5, 'China', 'Beiging', 'No', datetime(2020-02-01 5:00:00),
5, 'Spain', 'Malaga', 'No', datetime(2020-02-02 5:00:00),
5, 'UK', 'London', 'No', datetime(2020-02-02 7:00:00),
6, 'China', 'Beiging', 'No', datetime(2020-02-02 12:00:00),
6, 'Spain', 'Madrid', 'No',datetime(2020-02-02 13:00:00),
6, 'Spain', 'Barcelona', 'No',datetime(2020-02-02 14:00:00),
7, 'Spain', 'Madrid', 'Yes',datetime(2020-02-02 13:00:00),
7, 'UK', 'London', 'Yes', datetime(2020-02-02 18:00:00),
7, 'Spain', 'Barcelona', 'Yes',datetime(2020-02-03 15:00:00),
7, 'China', 'Beiging', 'Yes', datetime(2020-02-03 21:00:00),
]
| order by TravellerId asc, TripComplete asc
//Incorrect because next() calculation should be limited to the same traveler.
//Should be something like - if tripComplete = Yes then nextEntry = next(EntryTime, 1, now()) else, nextEntry = next(EntryTime, 1, EntryTime)
| extend nextEntry = next(EntryTime, 1, now())
| extend diffNext = nextEntry - EntryTime
| where Country == "Spain"
| summarize TimeSpentInSpain = sum(diffNext) by TravellerId

you could try something along the following lines:
set query_now = datetime(2020-02-04 5:00:00);
datatable(TravellerId:int, Country:string, City:string, TripComplete:string, EntryTime: datetime)
[
1, 'China', 'Beiging', 'Yes', datetime(2020-02-02 12:00:00),
1, 'Spain', 'Madrid', 'Yes',datetime(2020-02-02 13:00:00),
1, 'Spain', 'Barcelona', 'Yes',datetime(2020-02-02 14:00:00),
1, 'UK', 'London', 'Yes', datetime(2020-02-02 15:00:00),
2, 'Spain', 'Malaga', 'No', datetime(2020-02-03 5:00:00),
]
| order by TravellerId asc, EntryTime asc
| extend diff = EntryTime - prev(EntryTime)
| where Country == "Spain"
| summarize result = sumif(diff, TripComplete == "Yes") + sumif(now() - EntryTime, TripComplete != "Yes") by TravellerId

Related

Dplyr: Create two columns based on specific conditions

In this dataset DF, we have 4 names and 4 professions.
DF<-tribble(
~names, ~princess, ~singer, ~astronaut, ~painter,
"diana", 4, 1, 2, 3,
"shakira", 2, 1, 3, 4,
"armstrong", 3, 4, 1, 2,
"picasso", 1, 3, 1, 4
)
Assume that the cell values are some measure of their their profession. So, for instance, Diana has highest cell value for princess (correctly) but Shakira has highest cell value for painter (incorrectly).
I want to create two columns called "Compatible" and "Incompatible" where the program will pick value of 4 for Diana as it is under the correct profession Princess and assign it to column "Compatible" and in the "Incompatible" put an average of the other 3 values. For Shakira, it will pick the value 1 from the correct profession of singer, and assign it to Compatible; for Incompatible it average the other values. Similarly for other names
So the output will be like this:
DF1<-tribble(
~names, ~princess, ~singer, ~astronaut, ~painter,~Compatible,~Incompatible,
"diana", 4, 1, 2, 3, 4, 2,
"shakira", 2, 1, 3, 4, 1, 3,
"armstrong", 3, 4, 1, 2, 1, 3,
"picasso", 1, 3, 1, 4, 4, 1.66
)
Here is the dataset which shows the correct names and professions:
DF3<- tribble(
~names, ~professions,
"diana", "princess",
"shakira", "singer",
"armstrong", "astronaut",
"picasso", "painter"
)
DF1[1:5] %>%
pivot_longer(-names) %>%
left_join(DF3, 'names') %>%
group_by(names, name = if_else(name == professions, 'compatible', 'incompatible')) %>%
summarise(profession = first(professions), value = mean(value), .groups = 'drop') %>%
pivot_wider()
# A tibble: 4 x 4
names profession compatible incompatible
<chr> <chr> <dbl> <dbl>
1 armstrong astronaut 1 3
2 diana princess 4 2
3 picasso painter 4 1.67
4 shakira singer 1 3

joining two dataframes on matching values of two common columns R

I have a two dataframes A and B that both have multiple columns. They share the common columns "week" and "store". I would like to join these two dataframes on the matching values of the common columns.
For example this is a small subset of the data that I have:
A = data.frame(retailer = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store = c(5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6),
week = c(2021100301, 2021092601, 2021091901, 2021091201, 2021082901, 2021082201, 2021081501, 2021080801,
2021080101, 2021072501, 2021071801, 2021071101, 2021070401, 2021062701, 2021062001, 2021061301),
dollars = c(121817.9, 367566.7, 507674.5, 421257.8, 453330.3, 607551.4, 462674.8,
464329.1, 339342.3, 549271.5, 496720.1, 554858.7, 382675.5,
373210.9, 422534.2, 381668.6))
and
B = data.frame(
week = c("2020080901", "2017111101", "2017061801", "2020090701", "2020090701", "2020090701",
"2020091201","2020082301", "2019122201", "2017102901"),
store = c(14071, 11468, 2428, 17777, 14821, 10935, 5127, 14772, 14772, 14772),
fill = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
I would like to join these two tables on the matching week AND store values in order to incorporate the "fill" column from B into A. Where the values don't match, I would like to have a label "0" in the fill column, instead of a 1. Is there a way I can do this? I am not sure which join to use as well, or if "merge" would be better for this? Essentially I am NOT trying to get rid of any rows that do not have the matching values for the two common columns. Thanks for any help!
We may do a left_join
library(dplyr)
library(tidyr)
A %>%
mutate(week = as.character(week)) %>%
left_join(B) %>%
mutate(fill = replace_na(fill, 0))

How to pivot wider in R on one column value

Below is the sample data and the manipulations that I have done so far. I have tried this in other ways but have an idea that may make this a bit simpler. The intended result is at the bottom. what i am looking for is a way to pivot wider based on when the smb column says total. There are five possible values for smb.. 1,2,3,4, and total. I want there to be a new column smb.total which will have the total for each smb/year/qtr/area combination. I have tried putting a filter in front of the pivot wider statement (at the bottom)
library(readxl)
library(dplyr)
library(stringr)
library(tidyverse)
library(gt)
employment <- c(1,45,125,130,165,260,600,601,2,46,127,132,167,265,601,602,50,61,110,121,170,305,55,603,52,66,112,123,172,310,604,605)
small <- c(1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA)
area <-c(001,001,001,001,001,001,001,001,001,001,001,001,001,001,001,001,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003)
year<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020)
qtr <-c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
smbtest <- data.frame(employment,small,area,year,qtr)
smbtest$smb <-0
smbtest <- smbtest %>% mutate(smb = case_when(employment >=0 & employment <100 ~ "1",employment >=0
& employment <150 ~ "2",employment >=0 & employment <250 ~ "3", employment >=0 & employment <500 ~
"4", employment >=0 & employment <100000 ~ "Total"))
smbsummary2<-smbtest %>%
mutate(period = paste0(year,"q",qtr)) %>%
group_by(area,period,smb) %>%
summarise(employment = sum(employment), worksites = n(),
.groups = 'drop_last') %>%
mutate(employment = cumsum(employment),
worksites = cumsum(worksites))
smbsummary2<- smbsummary2%>%
group_by(area,smb)%>%
mutate(empprevyear=lag(employment),
empprevyearpp=employment-empprevyear,
empprevyearpct=((employment/empprevyear)-1),
empprevyearpct=scales::percent(empprevyearpct,accuracy = 0.01)
)
area period smb employment worksites smb.Total
1 2020q1 1 46 2 1927
1 2020q1 2 301 4 1927
1 2020q1 3 466 5 1927
1 2020q1 4 726 6 1927
1 2020q1 Total 1927 8 1927
smbsummary2<-smbsummary2 %>%
filter(small=='Total')
pivot_wider(names_from = small, values_from = employment)
Maybe this code will solve your question:
employment <- c(1, 45, 125, 130, 165, 260, 600, 601, 2, 46, 127,
132, 167, 265, 601, 602, 50, 61, 110, 121, 170,
305, 55, 603, 52, 66, 112, 123, 172, 310, 604, 605)
small <- c(1, 1, 2, 2, 3, 4, NA, NA, 1, 1, 2, 2, 3, 4, NA, NA, 1, 1,
2, 2, 3, 4, NA, NA, 1, 1, 2, 2, 3, 4, NA, NA)
area <-c(001, 001, 001, 001, 001, 001, 001, 001, 001, 001, 001, 001,
001, 001, 001, 001, 003, 003, 003, 003, 003, 003, 003, 003,
003, 003, 003, 003, 003, 003, 003, 003)
year<-c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020)
qtr <-c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
smbtest <- tibble(employment, small, area, year, qtr)
smbtest$smb <- 0
smbtest <- smbtest %>%
mutate(smb = case_when(employment >=0 & employment <100 ~ "1",
employment >=0 & employment <150 ~ "2",
employment >=0 & employment <250 ~ "3",
employment >=0 & employment <500 ~ "4",
employment >=0 & employment <100000 ~ "Total"))
smbtest <- smbtest %>%
relocate(smb, year, qtr, area, small, employment)
smbsummary2 <- smbtest %>%
mutate(period = paste0(year,"q",qtr)) %>%
group_by(area, period, smb) %>%
summarise(employment = sum(employment),
worksites = n()) %>%
mutate(employment = cumsum(employment),
worksites = cumsum(worksites))
smbsummary2 %>%
group_by(area, period) %>%
mutate(`employ/period (%)` = employment/employment[smb == "Total"]*100)
Probably not the best answer, but for your data I think it's works well.
If not please tell me.
Good job!
I do know if I understand correctly.
Do you wanna smb.total of what? employment variable?
If yes.
In your object "smbsummary2" use this code:
smbsummary2 <- smbtest %>%
relocate(smb, year, qtr, area, small, employment) %>%
group_by(smb, year, qtr, area) %>%
mutate(smb.total = n())
If was not it, do you could be explain me better?

R: how to get last sold out date for each buying?

let say i'm a fruit vendor, and i have 2 table. 1 for buying and 1 for sales like bellow,
library(tibble)
library(tidyverse)
FruitBought <- tribble(
~name, ~Date, ~Qty,
"Apple", 20180101, 15,
"Apple", 20180105, 20,
"Banana", 20180102, 18,
"Banana", 20180109, 14
)
fruitSold <- tribble(
~Date, ~name, ~sold,
20180101, 'Apple', 5,
20180102, 'Apple', 3,
20180102, 'Banana', 3,
20180103, 'Apple', 1,
20180103, 'Banana', 4,
20180104, 'Apple', 2,
20180104, 'Banana', 2,
20180105, 'Apple', 1,
20180105, 'Banana', 2,
20180106, 'Apple', 2,
20180106, 'Banana', 3,
20180107, 'Apple', 2,
20180107, 'Banana', 1,
20180108, 'Apple', 0,
20180108, 'Banana', 3,
20180109, 'Apple', 2,
20180109, 'Banana', 1,
20180110, 'Apple', 3,
20180110, 'Banana', 1
)
I want to get last sold out date for each buying. like this.
name | Date | Qty | LastSoldOut
"Apple" | 20180101 | 15 | 20180107
"Apple" | 20180105 | 20 | NA
"Banana" | 20180102 | 18 | 20180109
"Banana" | 20180109 | 14 | NA
Is there anyone can help?
1) Here is a possible approach using data.table non-equi join:
#calculate the available stock at each date
FruitBought[, CumAvail:=cumsum(Qty), by=.(name)]
#calculate the fruits sold up to date
cumsold <- fruitSold[, .(Date, SoldToDate=cumsum(sold)), .(name)]
#use non-equi join to find the first date where
#sold to date is greater than available stock as per OP
cumsold[
FruitBought, on=.(name=name, Date>=Date, SoldToDate>CumAvail),
#for each row in FruitBought, find that first date
.(Name=i.name, Date=i.Date, Qty, LastSoldOut=x.Date[1L]), by=.EACHI][,
-(1L:3L)] #remove the joining columns
2) You can also use the data.table's roll argument by slightly tweaking the number sold to date:
FruitBought[, CumAvail:=cumsum(Qty), by=.(name)]
cumsoldTweak <- fruitSold[, .(Date, SoldToDate=cumsum(sold)-1e-2), .(name)]
cumsoldTweak[FruitBought, on=c("name", SoldToDate="CumAvail"), roll=-Inf,
.(name, Date=i.Date, Qty, LastSoldOut=Date)]
output:
Name Date Qty LastSoldOut
1: Apple 20180101 15 20180107
2: Apple 20180105 20 NA
3: Banana 20180102 18 20180109
4: Banana 20180109 14 NA
data:
library(data.table)
#data.table 1.11.4 Latest news: http://r-datatable.com
FruitBought <- fread("name,Date,Qty
Apple,20180101,15
Apple,20180105,20
Banana,20180102,18
Banana,20180109,14")
#order is important before doing cumsum
setorder(FruitBought, name, Date)
fruitSold <- fread("Date,name,sold
20180101,Apple,5
20180102,Apple,3
20180102,Banana,3
20180103,Apple,1
20180103,Banana,4
20180104,Apple,2
20180104,Banana,2
20180105,Apple,1
20180105,Banana,2
20180106,Apple,2
20180106,Banana,3
20180107,Apple,2
20180107,Banana,1
20180108,Apple,0
20180108,Banana,3
20180109,Apple,2
20180109,Banana,1
20180110,Apple,3
20180110,Banana,1")
#order is important before doing cumsum
setorder(fruitSold, Date, name)
You can do something like this to get the desired output
library(tibble)
library(tidyverse)
FruitBought <- tribble(
~name, ~Date, ~Qty,~id,
"Apple", 20180101, 15,1,
"Apple", 20180105, 20,2,
"Banana", 20180102, 18,1,
"Banana", 20180109, 14,2,
)
fruitSold <- tribble(
~Date, ~name, ~sold,~id,
20180101, 'Apple', 5,1,
20180102, 'Apple', 3,1,
20180102, 'Banana', 3,1,
20180103, 'Apple', 1,1,
20180103, 'Banana', 4,1,
20180104, 'Apple', 2,1,
20180104, 'Banana', 2,1,
20180105, 'Apple', 1,1,
20180105, 'Banana', 2,1,
20180106, 'Apple', 2,1,
20180106, 'Banana', 3,1,
20180107, 'Apple', 2,1,
20180107, 'Banana', 1,1,
20180108, 'Apple', 0,2,
20180108, 'Banana', 3,1,
20180109, 'Apple', 2,2,
20180109, 'Banana', 1,1,
20180110, 'Apple', 3,2,
20180110, 'Banana', 1,1
)
fruitSold$Date <- lubridate::ymd(fruitSold$Date)
FruitBought$Date <- lubridate::ymd(FruitBought$Date )
colnames(fruitSold)
sold <- as.data.frame(fruitSold %>% group_by(name,id) %>% summarise(last_date = max(Date)))
colnames(sold)
colnames(FruitBought)
final <- left_join(FruitBought,sold, by = c("name","id"))

SumIfs in R - creating a subset off of multiple criteria and summing a specific column

I have a set of panel data similar to:
city <- c("ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR")
week <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5)
df <- as.data.frame(cbind(city, week))
df$week <- as.numeric(df$week)
df$x <- c(6, 3, 9, 12, 4, 3, 7, 8, 2, 12, 15, 6, 3, 9, 0, 14, 18, 2, 21, 15, 17, 9, 10, 1, 22)
I would like to create a new variable, df$y, that sums df$x for each city, and for each week, prior to the week currently being observed. So, for example, df$y[25] should equal 31 because sum(df[df$city == "CAR" & df$week < 5, 3]) equals 31.
My question is, how can I write this in a function to do this automatically?
To use sum(df[df$city == "CAR" & df$week < 5, 3]) for each team and week combination would be tedious. My natural inclination is to write something like df$y <- sum(df[df$city == df$city & df$week < df$week, 3]), but that doesn't make sense. I'm new to R and don't fully understand functions; but, is that the best route for what I'm trying to do?
Thanks for your help!
One option with dplyr
library(dplyr)
res <- df %>%
group_by(city) %>%
mutate(y = cumsum(lag(x, default = 0)))
res[25,]
# A tibble: 1 x 4
# Groups: city [1]
# city week x y
# <fctr> <dbl> <dbl> <dbl>
#1 CAR 5 22 31
One option with data.table
setDT(df)[, y := c(0, cumsum(x[-length(x)])), by = 'city']
df

Resources