I have a large data set of data representing paired blocks of time, however I want to be able to have a clean break across year boundaries with each row starting and finishing in the same year.
As an example see the table below.
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 154 519 2 730
4 0 216 735 3 1095
There is no overlap between years one and two as row 3 starts on the first day of year two, however row 4 starts in year two and ends 5 days into year three. I want to split row 4 so that the table looks like the following.
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 0 519 1 365
4 1 154 519 2 730
5 0 211 524 2 730
6 0 5 735 3 1095
As can be seen there is no overlap across years as each overlapping block of time has been split up so each row starts and finishes in the same year. The way I have done this so far is as follows, however it seems clunky and I would hope there is a more elegant solution.
set.seed(808)
test <- data.frame(type = c(1,0), duration = round(runif(20, min = 100, max = 250))) %>%
mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )
test <- rbind(test[1,],
filter(test, lag(year) == year),
filter(test, lag(year) != year) %>%
mutate( duration = cumsum - (year-1)*365),
filter(test, lag(year) != year) %>%
mutate( duration = ((year-1)*365 + duration- cumsum),
cumsum = cumsum-duration,
year = year -1,
year.split = year*365) ) %>% arrange(year, cumsum)
test <- group_by( test,type, year) %>%
summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)
The final two lines of code summarises the data as I am interested in the total amount of each type per year.
What is a better way of doing this?
This seems to work, assuming that the durations are all strictly positive:
cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))
type duration cumsum year
1 1 236 236 1
2 0 129 365 1
3 1 154 519 2
4 0 211 730 2
5 0 5 735 3
Not sure if it's the R way that you are looking for, but you can simplify a bit your rbind function:
rbind (filter(test, cumsum - duration >= (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(duration = cumsum - (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(year = year - 1, # I'm changing the year first so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365)
)
As you can see I combine three data.frame:
Row which are correct, because the duration doesn't overlap two years
I take the rows overlapping and I set the duration to the number of days in the last year
I take the same rows and I change the values accordingly to the previous year.
There are two things I don't like here: I used twice the same filter (for case 2 and 3) and tomorrow I will need 10/15 minutes to understand this code (or I can put a comment like # It works, don't worry).
I think that a more verbose version of this code will be easier to maintain:
# These don't overlap
ok <- filter(test, cumsum - duration >= (year - 1) * 365)
# These do overlap! We need to split them in two
ko <- filter(test, cumsum - duration < (year - 1) * 365)
# For the most recent year, it's enough to change the duration
ko.recent <- mutate(ko,
duration = cumsum - (year - 1) * 365
)
# For the previous year, a bit more
ko.previous <- mutate(ko,
year = year - 1, # I'm changing the year first
# so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365
)
# Let me put them back together and sort them for you
test1 <- rbind (ok,
ko.recent,
ko.previous
)
Not sure if this was the answer you were looking for, I'm just learning R.
Related
I have a data set with time periods, that may overlap, showing me if somebody was present (example_df). I want to get a data set that splits a large time period (from 2014-01-01 to 2014-10-31) into smaller time periods where somebody was present (present = 1) and time periods where nobody was present (present = 0).
The result should look like result_df
Example data frame
example_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-03-05"), as.Date("2014-06-13"), as.Date("2014-08-15")),
end = c(as.Date("2014-04-07"), as.Date("2014-04-12"), as.Date("2014-08-05"), as.Date("2014-10-02")),
present = 1)
Result should look like this
result_df <- data.frame(ID = 1,
start = c(as.Date("2014-01-01"), as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02")),
end = c(as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02"), as.Date("2014-10-31")),
present = c(1, 0, 1, 0, 1, 0))
I have no idea how to tackle this problem as it requires to split time periods or add rows (or something else?). Any help is much appreciated!
I hope I can be helpful, as I have struggled with this as well.
As in IceCreamToucan's example, this assumes independence by person ID. This approach uses dplyr to look at overlap in date ranges and then flattens them. Other examples of this approach have been described in stackoverflow and use dplyr. The end result includes time ranges where the person is present.
library(tidyr)
library(dplyr)
pres <- example_df %>%
group_by(ID) %>%
arrange(start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end), present = 1) %>%
select(-indx)
Then, additional rows can be added to indicate time period when not present. In these cases, for a given ID, it will determine gaps between an older end date and a newer (more recent) start date. Then finally the result is ordered by ID and the start date.
result <- pres
for (i in unique(pres$ID)) {
pres_i <- subset(pres, ID == i)
if (nrow(pres_i) > 1) {
adding <- data.frame(ID = i, start = pres_i$end[-nrow(pres_i)]+1, end = pres_i$start[-1]-1, present = 0)
adding <- adding[adding$start <= adding$end, ]
result <- bind_rows(result, adding)
}
}
result[order(result$ID, result$start), ]
# A tibble: 5 x 4
# Groups: ID [1]
ID start end present
<dbl> <date> <date> <dbl>
1 1 2014-01-01 2014-04-12 1
2 1 2014-04-13 2014-06-12 0
3 1 2014-06-13 2014-08-05 1
4 1 2014-08-06 2014-08-14 0
5 1 2014-08-15 2014-10-02 1
Assuming you want to do it separately for each ID, you can create a data table with all dates for which someone was present, and join that with a table of all dates over that time period. The result is not exactly the same, because the present and not-present periods don't overlap.
library(data.table)
setDT(example_df)
example_df[, {
pres <- unique(unlist(Map(`:`, start, end)))
class(pres) <- 'Date'
all <- min(pres):max(pres)
class(all) <- 'Date'
pres <- data.table(day = pres)
all <- data.table(day = all)
out.full <- pres[all, on = .(day), .(day = i.day, present = +!is.na(x.day))]
out.full[, .(start = min(day), end = max(day)),
by = .(present, rid = rleid(present))][, -'rid']
}, by = ID]
# ID present start end
# 1: 1 1 2014-01-01 2014-04-12
# 2: 1 0 2014-04-13 2014-06-12
# 3: 1 1 2014-06-13 2014-08-05
# 4: 1 0 2014-08-06 2014-08-14
# 5: 1 1 2014-08-15 2014-10-02
Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
I have the following survival dataset that I would like to split the interval by January 1st of each year. For example, for personid 1220, i would make the split at 1912-01-01, 1913-01-01, 1914-01-01, 1915-01-01. I tried to use survSplit but they can only do numeric vector. Can you please let me know if there any other way?
In the dataset below, time = EndDate - StartDate. Here is what I have so far:
test.ts <- survSplit(Surv(time, censor) ~ .,
data = test,
cut = seq(0, 1826.25, 365.25),
episode = "tgroup")
but that only split by each year.
ID EndDate StartDate censor time status
1 1220 1915-03-01 1911-10-04 1 1244 Alive
3 4599 1906-02-15 1903-05-16 1 1006 Alive
4 6375 1899-04-10 1896-10-27 1 895 Alive
6 6386 1929-10-05 1922-01-26 0 1826 Outmigrated
7 6389 1933-12-08 1929-10-05 1 1525 Outmigrated
8 6390 1932-01-17 1927-07-24 1 1638 Dead 0-4 yrs
Not sure I understood what you wanted but it you want to replicate the information in your data frame for each year in the range of Start;End, you can do:
library(tidyverse)
library(lubridate)
df %>%
as_tibble() %>%
mutate(
RangeYear = map2(StartDate, EndDate, function(start, end) {
start <- `if`(day(start) == 1 && month(start) == 1,
year(start),
year(start) + 1)
seq(start, year(end))
})
) %>%
unnest(RangeYear)
I have a data frame with the following:
1) Store
2) DayOfWeek
3) Date
4) Sales
5) Customers
6) Open
7) Promo
8) StateHoliday
9) SchoolHoliday
10) StoreType
11) Assortment
12) CompetitionDistance
13) CompetitionOpenSinceMonth
14) CompetitionOpenSinceYear
15) Promo2
16) Promo2SinceWeek
17) Promo2SinceYear
18) PromoInterval
19) CompanyDistanceBin
20) CompetitionOpenSinceDate
21) DaysSinceCompetionOpen
I am trying to calculate the Average Sales for the Previous Quarter based on the date (basically date - 3 months). But, I need to also subset based on DayOfWeek and Promo. I have written a function and am using mapply.
quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{
storeDate = as.Date(storeDate,"%Y-%m-%d")
EndDate = ymd(as.Date(storeDate)) + ddays(-1)
EndDate = as.Date(storeDate,"%Y-%m-%d")
StartDate = ymd(storeDate + months(-3))
StartDate = as.Date(StartDate)
quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
quarterSales = 0
salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))
if (nrow(salesDf)>0)
quarterSales = as.numeric(round(salesDf$avgSales,digits=0))
return(quarterSales)
}
saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)
head(exampleset)
Store DayOfWeek Date Sales Promo
186 1 3 2013-06-05 5012 1
296 1 3 2013-04-10 4903 1
337 1 3 2013-05-29 5784 1
425 1 3 2013-05-08 5230 0
449 1 3 2013-04-03 4625 0
477 1 3 2013-03-27 6660 1
saleswithstore is a dataframe that has 1,000,000 rows. So, this solution is not workable because it performing badly and taking forever. Is there a better, more efficient way to have a specific subset on a dataframe like this and then and then take an average like I am trying to do here?
I am open to any suggestions. I admittedly am new to R.
#maubin0316, your intuition is right in the comment that you can just group by the rest of the variables. I put together this example using data.table
library(data.table)
set.seed(343)
# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
'DayOfWeek' = sample(1:7, 100, replace=T),
'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
'Sales' = sample(1000:10000, 100),
'Promo' = sample(c(0,1), 10, replace=T))
QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')
# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
# Limit between your dates, group by your variables of interest
data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
# Sort in an order that makes sense
data <- data[order(Store, DayOfWeek, Promo)]
return(data)
}
salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary
I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50