Counting Number of People in a Hotel (R) - r

I am working with the R programming language. Suppose there is a hotel that has a list of customers with their check-in and check-out times (Note: The actual value of the dates is "POSIXct" and is written as "year-month-date".):
check_in_date <- c('2010-01-01', '2010-01-02' ,'2010-01-01', '2010-01-08', '2010-01-08', '2010-01-15', '2010-01-15', '2010-01-16', '2010-01-19', '2010-01-22')
check_out_date <- c('2010-01-07', '2010-01-04' ,'2010-01-09', '2010-01-21', '2010-01-11', '2010-01-22', 'still in hotel as of today', '2010-01-20', '2010-01-25', '2010-01-29')
Person = c("John", "Smith", "Alex", "Peter", "Will", "Matt", "Tim", "Kevin", "Tom", "Adam")
hotel <- data.frame(check_in_date, check_out_date, Person )
The data looks like something like this:
check_in_date check_out_date Person
1 2010-01-01 2010-01-07 John
2 2010-01-02 2010-01-04 Smith
3 2010-01-01 2010-01-09 Alex
4 2010-01-08 2010-01-21 Peter
5 2010-01-08 2010-01-11 Will
6 2010-01-15 2010-01-22 Matt
7 2010-01-15 still in hotel as of today Tim
8 2010-01-16 2010-01-20 Kevin
9 2010-01-19 2010-01-25 Tom
10 2010-01-22 2010-01-29 Adam
Question: I am trying to find out on any given day, how many people were still in the hotel. This would look something like this (just an example, does not correspond to the above data):
day_of_the_year Number_of_people_currently_in_hotel
1 2010-01-01 1
2 2010-01-02 1
3 2010-01-03 2
4 2010-01-04 0
5 2010-01-05 5
6 2010-01-06 5
7 2010-01-07 2
8 2010-01-08 2
9 2010-01-09 8
I tried to solve this problem in 3 steps:
First Step: I generated a column containing every date from the start to the end (e.g. in this example, let's suppose that there are 31 days : from the start to the end of Jan-2010)
day_of_the_year = seq(as.Date("2010/1/1"), as.Date("2010/1/31"),by="day")
Second Step: I then determined how many people checked in to the hotel at each day:
library(dplyr)
#create some indicator variable
hotel$event = 1
check_ins = hotel %>% group_by(check_in_date) %>% summarise(n = n())
check_in_date n
<chr> <int>
1 2010-01-01 2
2 2010-01-02 1
3 2010-01-08 2
4 2010-01-15 2
5 2010-01-16 1
6 2010-01-19 1
7 2010-01-22 1
Third Step: I then repeated a similar step to determine how many people checked out of the hotel each day:
check_outs = hotel %>% group_by(check_out_date) %>% summarise(n = n())
check_out_date n
<chr> <int>
1 2010-01-04 1
2 2010-01-07 1
3 2010-01-09 1
4 2010-01-11 1
5 2010-01-20 1
6 2010-01-21 1
7 2010-01-22 1
8 2010-01-25 1
9 2010-01-29 1
10 still in hotel as of today 1
Problem: Now, I am not sure how to combine the above 3 Steps in such a way so that we can find out how many people were staying at the hotel each day of the month. Can someone please show me how to do this?
Thanks!
Note: I found a "similar" question counting the number of people in the system in R , I am currently trying to see if I can adapt the methods used in this question for my problem.

I used hotel$check_in_date = as.Date(hotel$check_in_date) and hotel$check_out_date = as.Date(hotel$check_out_date) to convert the strings to dates. This function will then count the number of guests for a given date. Since you have a note in for guests that are currently checked in, I created a temporary data frame in the function to avoid overwriting the original data.
count_guests = function(date) {
temp = hotel
temp$check_out_date = ifelse(is.na(temp$check_out_date), as.Date(date), temp$check_out_date)
counts = ifelse((temp$check_in_date <= date) &(temp$check_out_date >= date), 1, 0)
return(sum(counts))
}
count_guests(as.Date("2010-01-02"))
[1] 3
count_guests(as.Date("2010-01-10"))
[1] 2
count_guests(as.Date("2010-01-21"))
[1] 4
EDIT: On second thought it looks like you want a new data frame. This can be done easily with apply().
guests = data.frame(day_of_the_year = seq(as.Date("2010/1/1"), as.Date("2010/1/31"),by="day"))
guests$num_checked_in = lapply(guests$day_of_the_year, FUN = count_guests)
day_of_the_year num_checked_in
1 2010-01-01 2
2 2010-01-02 3
3 2010-01-03 3
4 2010-01-04 3
5 2010-01-05 2
...

I think this might help, but for a total solution we need a reference date for those that did not check ou yet
library(tidyverse)
hotel %>%
mutate(
across(.cols = ends_with("_date"),.fns = ymd),
check_out_date = if_else(is.na(check_out_date), today(),check_out_date)
) %>%
mutate(
date = map2(
.x = check_in_date,
.y = check_out_date,
.f = function(x,y)seq.Date(from = x,to = y,by = "1 day"))
) %>%
unnest() %>%
count(date)
# A tibble: 29 x 2
date n
<date> <int>
1 2010-01-01 2
2 2010-01-02 3
3 2010-01-03 3
4 2010-01-04 3
5 2010-01-05 2
6 2010-01-06 2
7 2010-01-07 2
8 2010-01-08 3
9 2010-01-09 3
10 2010-01-10 2
# ... with 19 more rows

You can try using "lubridate" package which i believe is part of tidyverse. So if load tidyverse you don't have to load lubridate again.
Use ymd to convert character to date since year-month-day is the format of your date.
dt <- tibble(checkin = lubridate::ymd(check_in_date),
checkout = lubridate::ymd(check_out_date),
person = Person)
For anyone that has not checked out yet, assign them checkout date of today using today() function. Or if you know the date when this data was collected that may be another sensible date to assign here.
Create interval objects with start as checkin date and end as checkout date.
Similarly create interval object for the date(s) you want to check. Here I am using 2010-01-07.
Find overlap using int_overlap()
dt<- dt %>% mutate(
checkout = replace_na(checkout, today()),
stay_interval = lubridate::interval(start = checkin, end = checkout),
date_of_interest = lubridate::interval(ymd("2010-01-07"), ymd("2010-01-07")),
stay = lubridate::int_overlaps(date_of_interest, stay_interval)
)
dt %>% count(stay)
# A tibble: 2 x 2
stay n
<lgl> <int>
1 FALSE 8
2 TRUE 2

Related

time differences for multiple events for same ID in R

I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
Value<-c(0,1,0,1,0,0,0,1,0,1,1,0)
StoreID<-c('a','a','b','b','c','c','d','d','a','a','d','c')
Unit<-c(1,1,2,2,5,5,6,6,1,1,6,5)
Helper<-c('a1','a1','b2','b2','c5','c5','d6','d6','a1','a1','d6','c5')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Identify and remove duplicates by a criteria in R

Hi I am puzzled with a problem concerning duplicates in R. I have looked around a lot and don't seem to find any help. I have a dataset like that
x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006",
"08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
"02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
Group = c(1,1,1,2,2,3,4,2,3,4,4),
TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
"NA", "07/09/2009", "07/09/2009", "08/10/2009"),
Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
)
> x
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
2 A 09/07/2006 06/08/2006 1 08/09/2006 4
3 A 09/07/2006 06/08/2006 1 08/10/2006 4858
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
9 B 08/06/2009 06/07/2009 3 07/09/2009 795
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
So basically what I am trying to do is to identify duplicates in the TestDate variable by ID. For example dates 08/09/2006 and 08/10/2006 seem to be repeated in the same person but for different Group and I don't want the same Testdate to be in different Group by ID. The criteria to choose which TestDate to choose is to take the difference in days of TestDate with StartDate and EndDate for the different groups and then keep the one with the smallest difference in days. For example, about the date 08/10/2006 I would like to keep row 5 as the TestDate there is closer to the StartDate, than compared with the same differences in row 3. Eventually, I would like to get with a dataset like that
> xfinal
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
Any help on that will be much appreciated. Thanks
x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")
x <- x[order(x$id,x$Diff),]
x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources