let say i'm a fruit vendor, and i have 2 table. 1 for buying and 1 for sales like bellow,
FruitBought <- tribble(
~name, ~Date, ~Qty,
"Apple", 20180101, 15,
"Apple", 20180105, 20,
"Banana", 20180102, 18,
"Banana", 20180109, 14
fruitSold <- tribble(
~Date, ~name, ~sold,
20180101, 'Apple', 5,
20180102, 'Apple', 3,
20180102, 'Banana', 3,
20180103, 'Apple', 1,
20180103, 'Banana', 4,
20180104, 'Apple', 2,
20180104, 'Banana', 2,
20180105, 'Apple', 1,
20180105, 'Banana', 2,
20180106, 'Apple', 2,
20180106, 'Banana', 3,
20180107, 'Apple', 2,
20180107, 'Banana', 1,
20180108, 'Apple', 0,
20180108, 'Banana', 3,
20180109, 'Apple', 2,
20180109, 'Banana', 1,
20180110, 'Apple', 3,
20180110, 'Banana', 1
I want to get last sold out date for each buying. like this.
name | Date | Qty | LastSoldOut
"Apple" | 20180101 | 15 | 20180107
"Apple" | 20180105 | 20 | NA
"Banana" | 20180102 | 18 | 20180109
"Banana" | 20180109 | 14 | NA
Is there anyone can help?
1) Here is a possible approach using data.table non-equi join:
#calculate the available stock at each date
FruitBought[, CumAvail:=cumsum(Qty), by=.(name)]
#calculate the fruits sold up to date
cumsold <- fruitSold[, .(Date, SoldToDate=cumsum(sold)), .(name)]
#use non-equi join to find the first date where
#sold to date is greater than available stock as per OP
FruitBought, on=.(name=name, Date>=Date, SoldToDate>CumAvail),
#for each row in FruitBought, find that first date
.(Name=i.name, Date=i.Date, Qty, LastSoldOut=x.Date[1L]), by=.EACHI][,
-(1L:3L)] #remove the joining columns
2) You can also use the data.table's roll argument by slightly tweaking the number sold to date:
FruitBought[, CumAvail:=cumsum(Qty), by=.(name)]
cumsoldTweak <- fruitSold[, .(Date, SoldToDate=cumsum(sold)-1e-2), .(name)]
cumsoldTweak[FruitBought, on=c("name", SoldToDate="CumAvail"), roll=-Inf,
.(name, Date=i.Date, Qty, LastSoldOut=Date)]
Name Date Qty LastSoldOut
1: Apple 20180101 15 20180107
2: Apple 20180105 20 NA
3: Banana 20180102 18 20180109
4: Banana 20180109 14 NA
#data.table 1.11.4 Latest news: http://r-datatable.com
FruitBought <- fread("name,Date,Qty
#order is important before doing cumsum
setorder(FruitBought, name, Date)
fruitSold <- fread("Date,name,sold
#order is important before doing cumsum
setorder(fruitSold, Date, name)
You can do something like this to get the desired output
FruitBought <- tribble(
~name, ~Date, ~Qty,~id,
"Apple", 20180101, 15,1,
"Apple", 20180105, 20,2,
"Banana", 20180102, 18,1,
"Banana", 20180109, 14,2,
fruitSold <- tribble(
~Date, ~name, ~sold,~id,
20180101, 'Apple', 5,1,
20180102, 'Apple', 3,1,
20180102, 'Banana', 3,1,
20180103, 'Apple', 1,1,
20180103, 'Banana', 4,1,
20180104, 'Apple', 2,1,
20180104, 'Banana', 2,1,
20180105, 'Apple', 1,1,
20180105, 'Banana', 2,1,
20180106, 'Apple', 2,1,
20180106, 'Banana', 3,1,
20180107, 'Apple', 2,1,
20180107, 'Banana', 1,1,
20180108, 'Apple', 0,2,
20180108, 'Banana', 3,1,
20180109, 'Apple', 2,2,
20180109, 'Banana', 1,1,
20180110, 'Apple', 3,2,
20180110, 'Banana', 1,1
fruitSold$Date <- lubridate::ymd(fruitSold$Date)
FruitBought$Date <- lubridate::ymd(FruitBought$Date )
sold <- as.data.frame(fruitSold %>% group_by(name,id) %>% summarise(last_date = max(Date)))
final <- left_join(FruitBought,sold, by = c("name","id"))
In this dataset DF, we have 4 names and 4 professions.
~names, ~princess, ~singer, ~astronaut, ~painter,
"diana", 4, 1, 2, 3,
"shakira", 2, 1, 3, 4,
"armstrong", 3, 4, 1, 2,
"picasso", 1, 3, 1, 4
Assume that the cell values are some measure of their their profession. So, for instance, Diana has highest cell value for princess (correctly) but Shakira has highest cell value for painter (incorrectly).
I want to create two columns called "Compatible" and "Incompatible" where the program will pick value of 4 for Diana as it is under the correct profession Princess and assign it to column "Compatible" and in the "Incompatible" put an average of the other 3 values. For Shakira, it will pick the value 1 from the correct profession of singer, and assign it to Compatible; for Incompatible it average the other values. Similarly for other names
So the output will be like this:
~names, ~princess, ~singer, ~astronaut, ~painter,~Compatible,~Incompatible,
"diana", 4, 1, 2, 3, 4, 2,
"shakira", 2, 1, 3, 4, 1, 3,
"armstrong", 3, 4, 1, 2, 1, 3,
"picasso", 1, 3, 1, 4, 4, 1.66
Here is the dataset which shows the correct names and professions:
DF3<- tribble(
~names, ~professions,
"diana", "princess",
"shakira", "singer",
"armstrong", "astronaut",
"picasso", "painter"
DF1[1:5] %>%
pivot_longer(-names) %>%
left_join(DF3, 'names') %>%
group_by(names, name = if_else(name == professions, 'compatible', 'incompatible')) %>%
summarise(profession = first(professions), value = mean(value), .groups = 'drop') %>%
# A tibble: 4 x 4
names profession compatible incompatible
<chr> <chr> <dbl> <dbl>
1 armstrong astronaut 1 3
2 diana princess 4 2
3 picasso painter 4 1.67
4 shakira singer 1 3
I want to count how often each pairwise combination of unique elements in column c in data frame df co-occurs on the elements of column a, but with the addition that co-occurrences are only counted if the respective values in column b are unequal, i.e., conditional on a non-match in column b
a <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
b <- c(1,1,2,2,2,1,1,2,2,3,3,3,3,1,1,1,2,2,2,4)
c <- c(1,2,1,2,3,2,3,1,2,1,1,2,3,1,2,1,1,2,4,1)
df <- as.data.frame(cbind(a,b,c))
Without considering column b I could do the following to retain for each pair of elements of column c, on how many elements of a they co-occur
df <- unique(df[,c(1,3)])
df <- merge(df, df, by = "a")
df$count <- 1
df <- aggregate(count ~ ., df[, c(2:4)], sum)
df <- df[df$c.x != df$c.y,]
With the additional condition of a non-match in b, there is only one difference: elements 2 and 4 of column c both co-occur on element 4 of column a, but have the same value in b and should therefore not be counted to end up with:
c.x <- c(2,3,4,1,3,1,2,1)
c.y <- c(1,1,1,2,2,3,3,4)
count <- c(4,3,1,4,3,3,3,1)
result <- as.data.frame(cbind(c.x,c.y,count))
As the original data set is large (> 1,000,000 observations), I welcome fast solutions, i.e., without using loops or merges. Usually, I create co-occurrence matrices from three-column data frames using sparseMatrix()
I'm not sure from your description if this is what you had in mind, nor how fast this would turn out to be, but here is an approach with purrr:
split(df, c) %>%
combn(2, simplify = F) %>%
set_names(map(., ~ paste(names(.x), collapse = "_"))) %>%
map_int(~ merge(.x[[1]], .x[[2]], by = NULL) %>%
dplyr::filter(a.x == a.y && b.x != b.y) %>%
1_2 1_3 1_4 2_3 2_4 3_4
0 27 0 21 0 0
# Data used:
df <- structure(list(a = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4), b = c(1, 1, 2, 2, 2, 1, 1, 2, 2, 3, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4), c = c(1, 2, 1, 2, 3, 2, 3, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 4, 1)), class = "data.frame", row.names = c(NA, -20L))
Is there a way to find datetime difference grouping by a column in Data Explorer Kusto? I would like to find out total time spent by each traveler in Spain.
A traveler is considered to be in a country from the time they arrive in that country till the time they arrive in their next destination. Here are the edge cases:
If TripComplete == 'Yes' and that country is the last visited destination the difference, the trip is ongoing and time spent should be difference between now() and entry time.
If TripComplete == 'No' and that country is the last visited destination it is considered as the end of trip.
Say for Spain, the time spent in Spain should be time spent in Spain in previous trips by the traveler + time since their last entry in that city of Spain (now() - EntryTime).
Here is the expected result.
TravellerId result
1 05:00:00 [Madrid to Barcelona + Barcelona to London]
2 00:00:00 [Trip complete]
3 1.00:00:00 [now() - Malaga EntryTime]
4 05:00:00 [now() - Malaga EntryTime]
5 2:00:00 [Malaga to London]
6 1.16:00:00 [Madrid to Barcelona + (now() - Barcelona EntryTime)]
7 11:00:00 [Madrid to London + Barcelona to Beiging]
set query_now = datetime(2020-02-04 5:00:00);
datatable(TravellerId:int, Country:string, City:string, TripComplete:string, EntryTime: datetime)
1, 'China', 'Beiging', 'Yes', datetime(2020-02-02 12:00:00),
1, 'Spain', 'Madrid', 'Yes',datetime(2020-02-02 13:00:00),
1, 'Spain', 'Barcelona', 'Yes',datetime(2020-02-02 15:00:00),
1, 'UK', 'London', 'Yes', datetime(2020-02-02 18:00:00),
2, 'Spain', 'Malaga', 'Yes', datetime(2020-02-03 5:00:00),
3, 'Spain', 'Malaga', 'No', datetime(2020-02-03 5:00:00),
4, 'China', 'Beiging', 'No', datetime(2020-02-03 5:00:00),
4, 'Spain', 'Malaga', 'No', datetime(2020-02-04 00:00:00),
5, 'China', 'Beiging', 'No', datetime(2020-02-01 5:00:00),
5, 'Spain', 'Malaga', 'No', datetime(2020-02-02 5:00:00),
5, 'UK', 'London', 'No', datetime(2020-02-02 7:00:00),
6, 'China', 'Beiging', 'No', datetime(2020-02-02 12:00:00),
6, 'Spain', 'Madrid', 'No',datetime(2020-02-02 13:00:00),
6, 'Spain', 'Barcelona', 'No',datetime(2020-02-02 14:00:00),
7, 'Spain', 'Madrid', 'Yes',datetime(2020-02-02 13:00:00),
7, 'UK', 'London', 'Yes', datetime(2020-02-02 18:00:00),
7, 'Spain', 'Barcelona', 'Yes',datetime(2020-02-03 15:00:00),
7, 'China', 'Beiging', 'Yes', datetime(2020-02-03 21:00:00),
| order by TravellerId asc, TripComplete asc
//Incorrect because next() calculation should be limited to the same traveler.
//Should be something like - if tripComplete = Yes then nextEntry = next(EntryTime, 1, now()) else, nextEntry = next(EntryTime, 1, EntryTime)
| extend nextEntry = next(EntryTime, 1, now())
| extend diffNext = nextEntry - EntryTime
| where Country == "Spain"
| summarize TimeSpentInSpain = sum(diffNext) by TravellerId
you could try something along the following lines:
set query_now = datetime(2020-02-04 5:00:00);
datatable(TravellerId:int, Country:string, City:string, TripComplete:string, EntryTime: datetime)
1, 'China', 'Beiging', 'Yes', datetime(2020-02-02 12:00:00),
1, 'Spain', 'Madrid', 'Yes',datetime(2020-02-02 13:00:00),
1, 'Spain', 'Barcelona', 'Yes',datetime(2020-02-02 14:00:00),
1, 'UK', 'London', 'Yes', datetime(2020-02-02 15:00:00),
2, 'Spain', 'Malaga', 'No', datetime(2020-02-03 5:00:00),
| order by TravellerId asc, EntryTime asc
| extend diff = EntryTime - prev(EntryTime)
| where Country == "Spain"
| summarize result = sumif(diff, TripComplete == "Yes") + sumif(now() - EntryTime, TripComplete != "Yes") by TravellerId
I have a set of panel data similar to:
city <- c("ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR")
week <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5)
df <- as.data.frame(cbind(city, week))
df$week <- as.numeric(df$week)
df$x <- c(6, 3, 9, 12, 4, 3, 7, 8, 2, 12, 15, 6, 3, 9, 0, 14, 18, 2, 21, 15, 17, 9, 10, 1, 22)
I would like to create a new variable, df$y, that sums df$x for each city, and for each week, prior to the week currently being observed. So, for example, df$y[25] should equal 31 because sum(df[df$city == "CAR" & df$week < 5, 3]) equals 31.
My question is, how can I write this in a function to do this automatically?
To use sum(df[df$city == "CAR" & df$week < 5, 3]) for each team and week combination would be tedious. My natural inclination is to write something like df$y <- sum(df[df$city == df$city & df$week < df$week, 3]), but that doesn't make sense. I'm new to R and don't fully understand functions; but, is that the best route for what I'm trying to do?
Thanks for your help!
One option with dplyr
res <- df %>%
group_by(city) %>%
mutate(y = cumsum(lag(x, default = 0)))
# A tibble: 1 x 4
# Groups: city [1]
# city week x y
# <fctr> <dbl> <dbl> <dbl>
#1 CAR 5 22 31
One option with data.table
setDT(df)[, y := c(0, cumsum(x[-length(x)])), by = 'city']
I have a data frame:
A<- c(NA, 1, 2, NA, 3, NA)
R<- c(2, 1, 2, 1, NA, 1)
C<- c(rep ("B",3), rep ("D", 3))
data1<-data.frame (A,R,C)
And I wan to merge column A and R, to have a data frame like data2
AR<- c(2, 1, 2, 1, 3, 1)
C<- c(rep ("B",3), rep ("D", 3))
data2<-data.frame (AR,C)
Do you know how I can do that?
You might want to consider what happens if "A" and "R" have different values, but this should work:
data2 <- with(data1, data.frame(AR=ifelse(is.na(A), R, A), C=C))