Count number of occurences in date range in R - r

I have a dataframe with a number of accounts, their status and the start and endtime for that status. I would like to report on the number of accounts in each of these statuses over a date range. The data looks like the df below, with the resulting report. (Actual data contains more state values. N/A values are shown with a dummy date far in the future.)
df <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6),
number.open = c(2,2,2,1,1,1)
)
I have looked at options involving rowwise() and mutate from dplyr and foverlaps from data.table, but haven't been able to code it up so it works.
(See Checking if Date is Between two Dates in R)

We can use sapply to do this for us:
report$NumberOpen <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$state == 'Open'))
# report
# date NumberOpen
# 1 2016-04-01 2
# 2 2016-04-02 2
# 3 2016-04-03 2
# 4 2016-04-04 1
# 5 2016-04-05 1
# 6 2016-04-06 1
data
df1 <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6)
)

Related

Count the number of rows within a certain time range based on each row in R (tidyverse)

I want to count the number of rows within a certain time range based on each row after grouping by id. For instance, let us say a 1-month window around each datetime entry in the column "cleaned_date".
head(data$cleaned_date)
[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"
The dates are in POSIXct format.
For the first entry, I need to count the number of rows within the time from 2004-09-11 to 2004-11-11, for the second entry, count the number of rows within the time from 2008-08-10 to 2008-10-10, so on and so forth.
I used roughly the following code
data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())
But it does not seem to work and I got counts as an empty column. Any help would be appreciated, thanks!
A reproducible example can be the following:
The input is
cleaned_date id
1 2008-09-11 A
2 2008-09-10 B
3 2008-09-30 B
4 2011-10-25 A
5 2011-11-14 A
And I want the output to be
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
For the first entry, I want to count the rows in the timeframe 2008-08-11 to 2008-10-11, the second entry seems to satisfy but we need to group by "id", so it does not count. For the second entry I want to count the rows in the timeframe 2008-08-10 to 2008-10-10, rows 2 and 3 satisfy, so the counts is 2. For the third entry I want to count the rows in the timeframe 2008-08-30 to 2008-10-30, rows 2 and 3 satisfy again, so on and so forth.
Note that the actual dataset I would like to operate on has millions of rows, so it might be more efficient to use tidyverse rather than base R.
Perhaps not the most elegant solution.
# input data. Dates as character vector
input = data.frame(
cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"),
id = c("A", "B", "B", "A", "A")
)
# function to create a date window n months around specified date
window <- function(x, n = 1){
x <- rep(as.POSIXlt(x),2)
x[1]$mon <- x[1]$mon - n
x[2]$mon <- x[2]$mon + n
return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}
# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
}))
input
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
Edit for large datasets:
# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
cleaned_date = dates,
id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")
# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]
counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
#pb$tick()
w <- window(date)
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
length(which(tmp$id == tmp$id[x]))
}))
counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 3.247204 mins
If you want to go faster, you can run the loop in parallel
library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
w <- window(dates[i])
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
length(which(tmp$id == tmp$id[x]))
}))
tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 37.37211 secs
Note, I'm running this on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of RAM.
It is still hard to determine exactly what you're trying to accomplish, but this will at least get you counts for a specified date range:
df %>%
group_by(id) %>%
filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>%
mutate(counts = n())
Will give us:
cleaned_date id counts
<date> <chr> <int>
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2

How to add rows with time periods inbetween given time period?

I have a data set with time periods, that may overlap, showing me if somebody was present (example_df). I want to get a data set that splits a large time period (from 2014-01-01 to 2014-10-31) into smaller time periods where somebody was present (present = 1) and time periods where nobody was present (present = 0).
The result should look like result_df
Example data frame
example_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-03-05"), as.Date("2014-06-13"), as.Date("2014-08-15")),
end = c(as.Date("2014-04-07"), as.Date("2014-04-12"), as.Date("2014-08-05"), as.Date("2014-10-02")),
present = 1)
Result should look like this
result_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02")),
end = c(as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02"), as.Date("2014-10-31")),
present = c(1, 0, 1, 0, 1, 0))
I have no idea how to tackle this problem as it requires to split time periods or add rows (or something else?). Any help is much appreciated!
I hope I can be helpful, as I have struggled with this as well.
As in IceCreamToucan's example, this assumes independence by person ID. This approach uses dplyr to look at overlap in date ranges and then flattens them. Other examples of this approach have been described in stackoverflow and use dplyr. The end result includes time ranges where the person is present.
library(tidyr)
library(dplyr)
pres <- example_df %>%
group_by(ID) %>%
arrange(start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end), present = 1) %>%
select(-indx)
Then, additional rows can be added to indicate time period when not present. In these cases, for a given ID, it will determine gaps between an older end date and a newer (more recent) start date. Then finally the result is ordered by ID and the start date.
result <- pres
for (i in unique(pres$ID)) {
pres_i <- subset(pres, ID == i)
if (nrow(pres_i) > 1) {
adding <- data.frame(ID = i, start = pres_i$end[-nrow(pres_i)]+1, end = pres_i$start[-1]-1, present = 0)
adding <- adding[adding$start <= adding$end, ]
result <- bind_rows(result, adding)
}
}
result[order(result$ID, result$start), ]
# A tibble: 5 x 4
# Groups: ID [1]
ID start end present
<dbl> <date> <date> <dbl>
1 1 2014-01-01 2014-04-12 1
2 1 2014-04-13 2014-06-12 0
3 1 2014-06-13 2014-08-05 1
4 1 2014-08-06 2014-08-14 0
5 1 2014-08-15 2014-10-02 1
Assuming you want to do it separately for each ID, you can create a data table with all dates for which someone was present, and join that with a table of all dates over that time period. The result is not exactly the same, because the present and not-present periods don't overlap.
library(data.table)
setDT(example_df)
example_df[, {
pres <- unique(unlist(Map(`:`, start, end)))
class(pres) <- 'Date'
all <- min(pres):max(pres)
class(all) <- 'Date'
pres <- data.table(day = pres)
all <- data.table(day = all)
out.full <- pres[all, on = .(day), .(day = i.day, present = +!is.na(x.day))]
out.full[, .(start = min(day), end = max(day)),
by = .(present, rid = rleid(present))][, -'rid']
}, by = ID]
# ID present start end
# 1: 1 1 2014-01-01 2014-04-12
# 2: 1 0 2014-04-13 2014-06-12
# 3: 1 1 2014-06-13 2014-08-05
# 4: 1 0 2014-08-06 2014-08-14
# 5: 1 1 2014-08-15 2014-10-02

Expanding R Matrix on Date

I have the following R matrix:
Date MyVal
2016 1
2017 2
2018 3
....
2026 10
What I want to do is "blow it up" so that it goes like this (where monthly values are linearly interpolated):
Date MyVal
01/01/2016 1
02/01/2016 ..
....
01/01/2017 2
....
01/01/2026 10
I realize I can easily generate the sequence using:
DateVec <- seq(as.Date(paste(minYear,"/01/01", sep = "")), as.Date(paste(maxYear, "/01/01", sep = "")), by = "month")
And I can use that to make a large matrix and then fill things in using a for loop over the DateVector in but I wonder if there's a more elegant R way to do this?
You can use stats::approx:
library(stats)
ipc <- approx(df$Date, df$MyVal, xout = DateVec,
rule = 1, method = "linear", ties = mean)
You probably need to first convert the data in your original data-frame to have month and day and also be in asPOSIXct or as.Date format.
Based on what you provided, this works:
#Make the reference data-frame for interpolation:
DateVec <- seq(min(df$Date, na.rm=T),
max(df$Date, na.rm=T), by = "month")
#Interpolation:
intrpltd_df <- approx(df$Date, df$MyVal, xout = DateVec,
rule = 1, method = "linear", ties = mean)
# x y
# 1 2016-01-01 1.000000
# 2 2016-02-01 1.084699
# 3 2016-03-01 1.163934
# 4 2016-04-01 1.248634
# 5 2016-05-01 1.330601
# 6 2016-06-01 1.415301
Data:
#reproducing the data-frame:
Date <- seq(2016,2026)
MyVal <- seq(1:11)
Date <- data.frame(as.Date(paste0(Date,"/01/01"))) #yyyy-mm-dd format
df <- cbind(Date, MyVal)
df <- as.data.frame(df)
colnames(df) <- c ("Date", "MyVal") #Changing Column Names

Alternatives to MAPPLY in R

I have a data frame with the following:
1) Store
2) DayOfWeek
3) Date
4) Sales
5) Customers
6) Open
7) Promo
8) StateHoliday
9) SchoolHoliday
10) StoreType
11) Assortment
12) CompetitionDistance
13) CompetitionOpenSinceMonth
14) CompetitionOpenSinceYear
15) Promo2
16) Promo2SinceWeek
17) Promo2SinceYear
18) PromoInterval
19) CompanyDistanceBin
20) CompetitionOpenSinceDate
21) DaysSinceCompetionOpen
I am trying to calculate the Average Sales for the Previous Quarter based on the date (basically date - 3 months). But, I need to also subset based on DayOfWeek and Promo. I have written a function and am using mapply.
quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{
storeDate = as.Date(storeDate,"%Y-%m-%d")
EndDate = ymd(as.Date(storeDate)) + ddays(-1)
EndDate = as.Date(storeDate,"%Y-%m-%d")
StartDate = ymd(storeDate + months(-3))
StartDate = as.Date(StartDate)
quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
quarterSales = 0
salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))
if (nrow(salesDf)>0)
quarterSales = as.numeric(round(salesDf$avgSales,digits=0))
return(quarterSales)
}
saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)
head(exampleset)
Store DayOfWeek Date Sales Promo
186 1 3 2013-06-05 5012 1
296 1 3 2013-04-10 4903 1
337 1 3 2013-05-29 5784 1
425 1 3 2013-05-08 5230 0
449 1 3 2013-04-03 4625 0
477 1 3 2013-03-27 6660 1
saleswithstore is a dataframe that has 1,000,000 rows. So, this solution is not workable because it performing badly and taking forever. Is there a better, more efficient way to have a specific subset on a dataframe like this and then and then take an average like I am trying to do here?
I am open to any suggestions. I admittedly am new to R.
#maubin0316, your intuition is right in the comment that you can just group by the rest of the variables. I put together this example using data.table
library(data.table)
set.seed(343)
# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
'DayOfWeek' = sample(1:7, 100, replace=T),
'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
'Sales' = sample(1000:10000, 100),
'Promo' = sample(c(0,1), 10, replace=T))
QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')
# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
# Limit between your dates, group by your variables of interest
data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
# Sort in an order that makes sense
data <- data[order(Store, DayOfWeek, Promo)]
return(data)
}
salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary

R: converting start/end dates into data series

I have the following data frame representing user subscriptions:
User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25
...
How can I transform it into a time series that gives me the count of active monthly subscriptions over time (assuming it is active in the month if at least for one day in that month). Something like this (based on the example above, assuming only 2 records):
Month Count
2015-08 0
2015-09 1
2015-10 2
2015-11 1
2015-12 1
2016-01 0
Rem: I took some arbitrary start and end dates for the time series, to make the example clear.
Prepare the data and make sure that the date columns are actually stored as dates:
data <- read.table(text = "User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25", header = TRUE)
data$StartDate <- as.Date(StartDate)
data$EndDate <- as.Date(EndDate))
This function returns a vector with all month that are within a subscription:
library(lubridate)
subscr_month <- function(start, end) {
start <- floor_date(start, "month")
seq <- seq(start, end, by = "1 month")
months <- format(seq, format = "%Y-%m")
return(months)
}
It uses the function floor_date() from the lubridate package. It is necessary to round of the start date, because otherwise the last month might be missing. For example, for user 2, if you add two month to the start date, you end up on 2015-12-27, which is after the end date, such that no date from December will be included in seq. The last line converts the Dates to character that only include year and month.
Now, you can apply this function to each start and end date from your data using mapply(). Afterwards, table() creates a table of counts of all dates in the resulting list:
all_month <- mapply(subscr_month, data$StartDate, data$EndDate, SIMPLIFY = FALSE)
table(unlist(all_month))
## 2015-09 2015-10 2015-11 2015-12
## 1 2 1 1
You can also convert the table to a data frame:
as.data.frame(table(unlist(all_month)))
## Var1 Freq
## 1 2015-09 1
## 2 2015-10 2
## 3 2015-11 1
## 4 2015-12 1
Your example output also includes the counts for months that do not appear in the data set. If you want to have this, you can convert the vector of months to a factor and set the levels to all the months you want to include:
month_list <- format(seq(as.Date("2015-08-01"), as.Date("2016-01-01"), by = "1 month"), format = "%Y-%m")
all_month_factor <- factor(unlist(all_month), levels = month_list)
table(all_month_factor)
## all_month_factor
## 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01
## 0 1 2 1 1 0
read the data frame mentioned.
df = structure(list(StartDate = structure(c(16681, 16735), class = "Date"),
EndDate = structure(c(16735, 16794), class = "Date")), class = "data.frame", .Names = c("StartDate",
"EndDate"), row.names = c(NA, -2L))
Could make good use of do in dplyr package and seq
df %>%
rowwise() %>% do({
w <- seq(.$StartDate,.$EndDate,by = "15 days") #for month difference less than 1 complete month
m <- format(w,"%Y-%m") %>% unique
data.frame(Month = m)
}) %>%
group_by(Month) %>%
summarise(Count = length(Month))

Resources