Related
I'm stuck trying to find a relatively simple way to count occurrences within a date range by group using R. I get the idea there has to be an easier way than what I'm trying.
I have over 6,000 groups, each group has anywhere from 1 to 100 IDs within, each with a start date and an end date anywhere from Jan 1, 1990 to today. I want to make a dataframe, one group per column, and one day per row, counting the number of IDs active per day from April 1, 2013 until March 31, 2018. For obvious reasons, using countifs in excel will not cut it.
I was trying to use this question as a starting point, as such:
df1 <- data.frame(group = c(1,1,2,3,3),
id = c(1,2,1,1,2),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01","2016-04-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05","2999-01-01"))
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 7))
report <- cbind(report,matrix(data=NA,nrow=7,ncol=3))
names(report) <- c('date',as.vector(unique(df1$group)))
daily <- function(i,...){
report[,i+1] <- sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))
}
for (i in unique(df1$group))
daily(i)
However, this doesn't seem to do anything (nor does it throw errors). Is there an easier way to do this? Am I way off base? Any help is appreciated for this non-programmer!
Additional help requested: I'm trying to modify Jaap's code in the answer below to include group start and group end times, so that the data table displays an NA when the group is not active.
Example data:
df2 <- data.frame(group = c(1,1,2,3,3),
groupopendate = c("2016-04-02","2016-04-02","2016-04-01","2016-04-02","2016-04-02"),
groupclosedate = c("2016-04-08","2016-04-08","2016-04-10","2016-04-09","2016-04-09"),
id = c(1,2,1,1,2),
startdate = c("2016-04-02","2016-04-04","2016-04-03","2016-04-02","2016-04-05"),
enddate = c("2016-04-04","2016-04-06","2016-04-10","2016-04-08","2016-04-08"))
Jaap's solution gives me this:
active grp1 grp2 grp3
1: 2016-04-02 1 0 1
2: 2016-04-03 1 1 1
3: 2016-04-04 1 1 1
4: 2016-04-05 1 1 2
5: 2016-04-06 0 1 2
6: 2016-04-07 0 1 2
However, what I want is such:
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 1 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 0
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA NA NA
Any help is appreciated!
A possible alternative solution using data.table:
# load the package & convert 'df1' to a data.table
library(data.table)
setDT(df1)
# convert the date columns to a date format
# not needed if they are
df1[, `:=` (startdate = as.Date(startdate), enddate = as.Date(enddate))]
# create a new data.table with the 'active' days
DT <- data.table(active = seq(from = as.Date("2016-04-01"), by = "day", length.out = 7))
# use a join and dcast to get the desired result
DT[df1
, on = .(active > startdate, active < enddate)
, allow = TRUE
, nomatch = 0
, .(active = x.active, group, id)
][, dcast(.SD, active ~ paste0("grp",group), value.var = "id", fun = length)]
which gives:
active grp1 grp2 grp3
1: 2016-04-01 1 1 0
2: 2016-04-02 1 1 1
3: 2016-04-03 1 1 1
4: 2016-04-04 0 1 1
5: 2016-04-05 1 1 1
6: 2016-04-06 1 1 1
7: 2016-04-07 1 1 1
NOTE: I've used paste0("grp",group) instead of just group in the dcast step as it leads to better columnnames (it is better not to use just numeric values as columnnames)
With regard to your additional example, you could solve that as follows:
setDT(df2)
df2[, c(2:3,5:6) := lapply(.SD, as.Date), .SDcols = c(2:3,5:6)]
DT <- data.table(active = seq(from = min(df2$groupopendate),
to = max(df2$groupclosedate),
by = "day"))
df2new <- df2[, .(active = seq.Date(startdate, enddate, by = "day"))
, by = .(group, id)
][, .N, by = .(group, active)
][df2[, .(active = seq.Date(groupopendate[1], groupclosedate[.N] - 1, by = "day"))
, by = .(group)]
, on = .(group, active)
][is.na(N), N := 0
][, dcast(.SD, active ~ paste0("grp",group))]
nms <- setdiff(names(df2new), "active")
DT[df2new
, on = .(active)
, (nms) := mget(paste0("i.",nms))][]
which gives:
> DT
active grp1 grp2 grp3
1: 2016-04-01 NA 0 NA
2: 2016-04-02 1 0 1
3: 2016-04-03 1 1 1
4: 2016-04-04 2 1 1
5: 2016-04-05 1 1 2
6: 2016-04-06 1 1 2
7: 2016-04-07 0 1 2
8: 2016-04-08 NA 1 2
9: 2016-04-09 NA 1 NA
10: 2016-04-10 NA 1 NA
I've figured it out! As usual, as soon as you post a question, you figure out the answer. I was overcomplicating it by putting in the function, when I could just put the sapply in the for loop.
If anyone is interested:
for (i in unique(df1$group))
{report[,i+1] <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$group == unique(df1$group)[i]))}
I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9
Let's say I have the following simple data.frame:
ID value
1 1 3
2 2 4
3 1 5
4 3 3
My desired output is below, where we add a value to cumsum or we update it according to the latest value of an already used ID.
ID value cumsum
1 1 3 3
2 2 4 7
3 1 5 9
4 3 3 12
In row 3, the new value forms an updated cumsum (7-3+5=9). Row 4 adds a new value to cumsum because the ID was not used before (4+5+3=12).
This produces the desired outcome for your example:
df<-read.table(header=T, text="ID value
1 1 3
2 2 4
3 1 5
4 3 3")
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value = value-lag(value, def = 0L)) %>%
ungroup %>% mutate(cumsum = cumsum(value))
# # A tibble: 4 x 3
# ID value cumsum
# <int> <int> <int>
# 1 1 3 3
# 2 2 4 7
# 3 1 2 9
# 4 3 3 12
I used data.table for cumsum. Calculating the cumulative mean is a bit more tricky because the number of oberservations is not adjusted by just using cummean.
library(data.table)
dt = data.table(id = c(1, 2, 1, 3), value = c(3, 4, 5, 3))
dt[, tmp := value-shift(value, n = 1L, type = "lag", fill = 0), by = c("id")]
#CUMSUM
dt[, cumsum := cumsum(tmp)]
#CUMMEAN WITH UPDATED N
dt[value != tmp, skip := 1:.N]
dt[, skip := na.locf(skip, na.rm = FALSE)]
dt[is.na(skip), skip := 0]
dt[, cummean := cumsum(tmp)/(seq_along(tmp)-skip)]
Output is:
id value tmp cumsum skip cummean
1: 1 3 3 3 0 3.0
2: 2 4 4 7 0 3.5
3: 1 5 2 9 1 4.5
4: 3 3 3 12 1 4.0
Edit: Changed lag function to data.table's shift function.
I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
this question is a modification of a problem I posted here where I have occurrences of a specific type on different days, but this time they are assigned to multiple users, for example:
df = data.frame(user_id = c(rep(1:2, each=5)),
cancelled_order = c(rep(c(0,1,1,0,0), 2)),
order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-03-23',
'2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21', '2015-03-26')))
user_id cancelled_order order_date
1 0 2015-01-28
1 1 2015-01-31
1 1 2015-02-08
1 0 2015-02-23
1 0 2015-03-23
2 0 2015-01-25
2 1 2015-01-28
2 1 2015-02-06
2 0 2015-02-21
2 0 2015-03-26
I'd like to calculate
1) the number of cancelled orders that each customer is going to have in the next x days (e.g. 7, 14), excluding the current one and
1) the number of cancelled orders that each customer had in the past x days (e.g. 7, 14) , excluding the current one.
The desired output would look like this:
solution
user_id cancelled_order order_date plus14 minus14
1 0 2015-01-28 2 0
1 1 2015-01-31 1 0
1 1 2015-02-08 0 1
1 0 2015-02-23 0 0
1 0 2015-03-23 0 0
2 0 2015-01-25 2 0
2 1 2015-01-28 1 0
2 1 2015-02-06 0 1
2 0 2015-02-21 0 0
2 0 2015-03-26 0 0
The solution that is perfectly fit for this purpose was presented by #joel.wilson using data.table
library(data.table)
vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) :=
lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date,
order_date,
order_date + i, # this part can be changed to reflect the past date ranges
incbounds = FALSE)])),
by = order_date]
However, it does not take into account grouping by user_id. When I tried to modify the formula by adding this grouping as by = c("user_id", "order_date") or by = list(user_id, order_date), it did not work. It seems it is something very basic, any hints on how to get around this detail?
Also, keep in mind that I'm after a solution that works, even if it is not based on the above code or data.table at all!
Thanks!
Here's one way:
library(data.table)
orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date))
vec = list(minus = 14L, plus = 14L)
orderDT[, c("dplus", "dminus") := .(
orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N
,
orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N
)]
id completed d dplus dminus
1: 1 TRUE 2015-01-28 2 0
2: 1 FALSE 2015-01-31 1 0
3: 1 FALSE 2015-02-08 0 1
4: 1 TRUE 2015-02-23 0 0
5: 1 TRUE 2015-03-23 0 0
6: 2 TRUE 2015-01-25 2 0
7: 2 FALSE 2015-01-28 1 0
8: 2 FALSE 2015-02-06 0 1
9: 2 TRUE 2015-02-21 0 0
10: 2 TRUE 2015-03-26 0 0
(I found OP's column names cumbersome and so shortened them.)
How it works
Each of the columns can be run on its own, like
orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N
And this can be broken down into steps by simplifying:
orderDT[!(completed)][
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)],
on=.(id, d <= d_plus, d >= d_tom),
.N,
by=.EACHI]$N
# original version
orderDT[!(completed)][
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)],
on=.(id, d <= d_plus, d >= d_tom),
.N,
by=.EACHI]
# don't extract the N column of counts
orderDT[!(completed)][
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)],
on=.(id, d <= d_plus, d >= d_tom)]
# don't create the N column of counts
orderDT[!(completed)]
# don't do the join
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)]
# see the second table used in the join
This uses a "non-equi" join, taking inequalities to define the date ranges. For more details, see the documentation page found by typing ?data.table.
I might have made this solution a bit complex:
library(dplyr)
library(tidyr)
vec <- c(7,14)
reslist <- lapply(vec, function(x){
df %>% merge(df %>% rename(cancelled_order2 = cancelled_order, order_date2 = order_date)) %>%
filter(abs(order_date-order_date2)<=x) %>%
group_by(user_id, order_date) %>% arrange(order_date2) %>% mutate(cumcancel = cumsum(cancelled_order2)) %>%
mutate(before = cumcancel - cancelled_order2,
after = max(cumcancel) - cumcancel) %>%
filter(order_date == order_date2) %>%
select(user_id, cancelled_order, order_date, before, after) %>%
mutate(within = x)})
do.call(rbind, reslist) %>% gather(key, value, -user_id, -cancelled_order, -order_date, -within) %>%
mutate(col = paste0(key,"_",within)) %>% select(-within, - key) %>% spread(col, value) %>% arrange(user_id, order_date)
PS:
I did spot a mistake in your output example (user_id 1, order_date 2015-02-23 ,minus14 should be 0, since there are 15 days between 02/08 and 02/23)
I recommend to use runner package. There is a function runner which executes any R function within sliding window.
To obtain sum from current 7-days window and 14-days window excluding current element one can use sum(x[length(x)]) for each window.
library(runner)
df %>%
group_by(user_id) %>%
mutate(
minus_7 = runner(cancelled_order, k = 7, idx = order_date,
f = function(x) sum(x[length(x)])),
minus_14 = runner(cancelled_order, k = 14, idx = order_date,
f = function(x) sum(x[length(x)])))
# A tibble: 10 x 5
# Groups: user_id [2]
user_id cancelled_order order_date minus_7 minus_14
<int> <dbl> <date> <dbl> <dbl>
1 1 0 2015-01-28 0 0
2 1 1 2015-01-31 1 1
3 1 1 2015-02-08 1 1
4 1 0 2015-02-23 0 0
5 1 0 2015-03-23 0 0
6 2 0 2015-01-25 0 0
7 2 1 2015-01-28 1 1
8 2 1 2015-02-06 1 1
9 2 0 2015-02-21 0 0
10 2 0 2015-03-26 0 0
For future elements it's bit tricky, because it's still 7-days window but lagged by -6 days (i:(i+6) = 7 days). Also in this case, first element of each window is excluded with sum(x[-1]).
df %>%
group_by(user_id) %>%
mutate(
plus_7 = runner(cancelled_order, k = 7, lag = -6, idx = order_date,
f = function(x) sum(x[-1])),
plus_14 = runner(cancelled_order, k = 14, lag = -13, idx = order_date,
f = function(x) sum(x[-1]))
)
# A tibble: 10 x 5
# Groups: user_id [2]
user_id cancelled_order order_date plus_7 plus_14
<int> <dbl> <date> <dbl> <dbl>
1 1 0 2015-01-28 1 2
2 1 1 2015-01-31 0 1
3 1 1 2015-02-08 0 0
4 1 0 2015-02-23 0 0
5 1 0 2015-03-23 0 0
6 2 0 2015-01-25 1 2
7 2 1 2015-01-28 0 1
8 2 1 2015-02-06 0 0
9 2 0 2015-02-21 0 0
10 2 0 2015-03-26 0 0
More information in package and function documentation.