Find dynamic intervals per group with Sparklyr - r

I have a huge (~10 billion rows) data.frame that looks a bit like this :
data <- data.frame(Person = c(rep("John", 9), rep("Steve", 7), rep("Jane", 4)),
Year = c(1900:1908, 1902:1908, 1905:1908),
Grade = c(c(6,3,4,4,8,5,2,9,7), c(4,3,5,5,6,4,7), c(3,7,2,9)) )
It's a set of 3 Persons, observed at different Years and we have their Grade for the Year in question. I would like to create a variable which, for each grade, returns "a simplified grade". The simplified grade is simply the Grade cutted in different intervals.
The difficulty is that the intervals are different by Person.
To get the intervals thresholds by Person, I have the following list :
list.threshold <- list(John = c(5,7), Steve = 4, Jane = c(3,5,8))
So the grades of Steve will be cutted in 2 intervals but the ones of Jane in 4 intervals.
Here are the results wanted (SimpleGrade) :
Person Year Grade SimpleGrade
1: John 1900 6 1
2: John 1901 3 0
3: John 1902 4 0
4: John 1903 4 0
5: John 1904 8 2
6: John 1905 5 1
7: John 1906 2 0
8: John 1907 9 2
9: John 1908 7 2
10: Steve 1902 4 1
11: Steve 1903 3 0
12: Steve 1904 5 1
13: Steve 1905 5 1
14: Steve 1906 6 1
15: Steve 1907 4 1
16: Steve 1908 7 1
17: Jane 1905 3 1
18: Jane 1906 7 2
19: Jane 1907 2 0
20: Jane 1908 9 3
I will have to find a solution in sparklyr because I'm working with a huge spark table.
In dplyr I would do something like this :
dplyr
data <- group_by(data, Person) %>%
mutate(SimpleGrade = cut(Grade, breaks = c(-Inf, list.threshold[[unique(Person)]], Inf), labels = FALSE, right = TRUE, include.lowest = TRUE) - 1)
It works but I'm having trouble converting this solution in sparklyr because of the fact that the thresholds are different per Person. I think I will have to use the ft_bucketizer function. Where I am so far with sparklyr :
sparklyr
spark_tbl <- group_by(spark_tbl, Person) %>%
ft_bucketizer(input_col = "Grade",
output_col = "SimpleGrade",
splits = c(-Inf, list.threshold[["John"]], Inf))
spark_tbl is only the spark table equivalent of data.
It works if I don't change the thresholds and use only the ones of John for example.
Thanks a lot, Tom C.

Spark ML Bucketizer can be used only for global operations so it won't work for you. Instead you can create a reference table
ref <- purrr::map2(names(list.threshold),
list.threshold,
function(name, brks) purrr::map2(
c("-Infinity", brks), c(brks, "Infinity"),
function(low, high) list(
name = name,
low = low,
high = high))) %>%
purrr::flatten() %>%
bind_rows() %>%
group_by(name) %>%
arrange(low, .by_group = TRUE) %>%
mutate(simple_grade = row_number() - 1) %>%
copy_to(sc, .) %>%
mutate_at(vars(one_of("low", "high")), as.numeric)
# Source: spark<?> [?? x 4]
name low high simple_grade
<chr> <dbl> <dbl> <dbl>
1 Jane -Inf 3 0
2 Jane 3 5 1
3 Jane 5 8 2
4 Jane 8 Inf 3
5 John -Inf 5 0
6 John 5 7 1
7 John 7 Inf 2
8 Steve -Inf 4 0
9 Steve 4 Inf 1
and then left_join it with the data table:
sdf <- copy_to(sc, data)
simplified <- left_join(sdf, ref, by=c("Person" = "name")) %>%
filter(Grade >= low & Grade < High) %>%
select(-low, -high)
simplified
# Source: spark<?> [?? x 4]
Person Year Grade simple_grade
<chr> <int> <dbl> <dbl>
1 John 1900 6 1
2 John 1901 3 0
3 John 1902 4 0
4 John 1903 4 0
5 John 1904 8 2
6 John 1905 5 1
7 John 1906 2 0
8 John 1907 9 2
9 John 1908 7 2
10 Steve 1902 4 1
# … with more rows
simplified %>% dbplyr::remote_query_plan()
== Physical Plan ==
*(2) Project [Person#132, Year#133, Grade#134, simple_grade#15]
+- *(2) BroadcastHashJoin [Person#132], [name#12], Inner, BuildRight, ((Grade#134 >= low#445) && (Grade#134 < high#446))
:- *(2) Filter (isnotnull(Grade#134) && isnotnull(Person#132))
: +- InMemoryTableScan [Person#132, Year#133, Grade#134], [isnotnull(Grade#134), isnotnull(Person#132)]
: +- InMemoryRelation [Person#132, Year#133, Grade#134], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Scan ExistingRDD[Person#132,Year#133,Grade#134]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *(1) Project [name#12, cast(low#13 as double) AS low#445, cast(high#14 as double) AS high#446, simple_grade#15]
+- *(1) Filter ((isnotnull(name#12) && isnotnull(cast(high#14 as double))) && isnotnull(cast(low#13 as double)))
+- InMemoryTableScan [high#14, low#13, name#12, simple_grade#15], [isnotnull(name#12), isnotnull(cast(high#14 as double)), isnotnull(cast(low#13 as double))]
+- InMemoryRelation [name#12, low#13, high#14, simple_grade#15], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Scan ExistingRDD[name#12,low#13,high#14,simple_grade#15]

Related

Counting Number of People in a Hotel (R)

I am working with the R programming language. Suppose there is a hotel that has a list of customers with their check-in and check-out times (Note: The actual value of the dates is "POSIXct" and is written as "year-month-date".):
check_in_date <- c('2010-01-01', '2010-01-02' ,'2010-01-01', '2010-01-08', '2010-01-08', '2010-01-15', '2010-01-15', '2010-01-16', '2010-01-19', '2010-01-22')
check_out_date <- c('2010-01-07', '2010-01-04' ,'2010-01-09', '2010-01-21', '2010-01-11', '2010-01-22', 'still in hotel as of today', '2010-01-20', '2010-01-25', '2010-01-29')
Person = c("John", "Smith", "Alex", "Peter", "Will", "Matt", "Tim", "Kevin", "Tom", "Adam")
hotel <- data.frame(check_in_date, check_out_date, Person )
The data looks like something like this:
check_in_date check_out_date Person
1 2010-01-01 2010-01-07 John
2 2010-01-02 2010-01-04 Smith
3 2010-01-01 2010-01-09 Alex
4 2010-01-08 2010-01-21 Peter
5 2010-01-08 2010-01-11 Will
6 2010-01-15 2010-01-22 Matt
7 2010-01-15 still in hotel as of today Tim
8 2010-01-16 2010-01-20 Kevin
9 2010-01-19 2010-01-25 Tom
10 2010-01-22 2010-01-29 Adam
Question: I am trying to find out on any given day, how many people were still in the hotel. This would look something like this (just an example, does not correspond to the above data):
day_of_the_year Number_of_people_currently_in_hotel
1 2010-01-01 1
2 2010-01-02 1
3 2010-01-03 2
4 2010-01-04 0
5 2010-01-05 5
6 2010-01-06 5
7 2010-01-07 2
8 2010-01-08 2
9 2010-01-09 8
I tried to solve this problem in 3 steps:
First Step: I generated a column containing every date from the start to the end (e.g. in this example, let's suppose that there are 31 days : from the start to the end of Jan-2010)
day_of_the_year = seq(as.Date("2010/1/1"), as.Date("2010/1/31"),by="day")
Second Step: I then determined how many people checked in to the hotel at each day:
library(dplyr)
#create some indicator variable
hotel$event = 1
check_ins = hotel %>% group_by(check_in_date) %>% summarise(n = n())
check_in_date n
<chr> <int>
1 2010-01-01 2
2 2010-01-02 1
3 2010-01-08 2
4 2010-01-15 2
5 2010-01-16 1
6 2010-01-19 1
7 2010-01-22 1
Third Step: I then repeated a similar step to determine how many people checked out of the hotel each day:
check_outs = hotel %>% group_by(check_out_date) %>% summarise(n = n())
check_out_date n
<chr> <int>
1 2010-01-04 1
2 2010-01-07 1
3 2010-01-09 1
4 2010-01-11 1
5 2010-01-20 1
6 2010-01-21 1
7 2010-01-22 1
8 2010-01-25 1
9 2010-01-29 1
10 still in hotel as of today 1
Problem: Now, I am not sure how to combine the above 3 Steps in such a way so that we can find out how many people were staying at the hotel each day of the month. Can someone please show me how to do this?
Thanks!
Note: I found a "similar" question counting the number of people in the system in R , I am currently trying to see if I can adapt the methods used in this question for my problem.
I used hotel$check_in_date = as.Date(hotel$check_in_date) and hotel$check_out_date = as.Date(hotel$check_out_date) to convert the strings to dates. This function will then count the number of guests for a given date. Since you have a note in for guests that are currently checked in, I created a temporary data frame in the function to avoid overwriting the original data.
count_guests = function(date) {
temp = hotel
temp$check_out_date = ifelse(is.na(temp$check_out_date), as.Date(date), temp$check_out_date)
counts = ifelse((temp$check_in_date <= date) &(temp$check_out_date >= date), 1, 0)
return(sum(counts))
}
count_guests(as.Date("2010-01-02"))
[1] 3
count_guests(as.Date("2010-01-10"))
[1] 2
count_guests(as.Date("2010-01-21"))
[1] 4
EDIT: On second thought it looks like you want a new data frame. This can be done easily with apply().
guests = data.frame(day_of_the_year = seq(as.Date("2010/1/1"), as.Date("2010/1/31"),by="day"))
guests$num_checked_in = lapply(guests$day_of_the_year, FUN = count_guests)
day_of_the_year num_checked_in
1 2010-01-01 2
2 2010-01-02 3
3 2010-01-03 3
4 2010-01-04 3
5 2010-01-05 2
...
I think this might help, but for a total solution we need a reference date for those that did not check ou yet
library(tidyverse)
hotel %>%
mutate(
across(.cols = ends_with("_date"),.fns = ymd),
check_out_date = if_else(is.na(check_out_date), today(),check_out_date)
) %>%
mutate(
date = map2(
.x = check_in_date,
.y = check_out_date,
.f = function(x,y)seq.Date(from = x,to = y,by = "1 day"))
) %>%
unnest() %>%
count(date)
# A tibble: 29 x 2
date n
<date> <int>
1 2010-01-01 2
2 2010-01-02 3
3 2010-01-03 3
4 2010-01-04 3
5 2010-01-05 2
6 2010-01-06 2
7 2010-01-07 2
8 2010-01-08 3
9 2010-01-09 3
10 2010-01-10 2
# ... with 19 more rows
You can try using "lubridate" package which i believe is part of tidyverse. So if load tidyverse you don't have to load lubridate again.
Use ymd to convert character to date since year-month-day is the format of your date.
dt <- tibble(checkin = lubridate::ymd(check_in_date),
checkout = lubridate::ymd(check_out_date),
person = Person)
For anyone that has not checked out yet, assign them checkout date of today using today() function. Or if you know the date when this data was collected that may be another sensible date to assign here.
Create interval objects with start as checkin date and end as checkout date.
Similarly create interval object for the date(s) you want to check. Here I am using 2010-01-07.
Find overlap using int_overlap()
dt<- dt %>% mutate(
checkout = replace_na(checkout, today()),
stay_interval = lubridate::interval(start = checkin, end = checkout),
date_of_interest = lubridate::interval(ymd("2010-01-07"), ymd("2010-01-07")),
stay = lubridate::int_overlaps(date_of_interest, stay_interval)
)
dt %>% count(stay)
# A tibble: 2 x 2
stay n
<lgl> <int>
1 FALSE 8
2 TRUE 2

Using R, how can I count objects according to multiple conditions?

I am trying to count objects in data frame of 911 calls according to certain conditions and I am having trouble with the logic. My actual data has over 3 million rows, so I've tried to simplify my problem by considering this small subset:
dat <- structure(list(call = c("14-1234", "14-4523", "14-7711", "14-8199", "14-3124"),
badge = c("8456", "1098", "3432", "4750", "5122"),
off.sex = c("Male", "Male", "Female", "Male", "Male"),
shift = c("1", "1", "1", "1", "2"),
assignedmin = c(1902, 1870, 1950, 1899, 1907),
clearedmin = c(1980, 1910, 1990, 1912, 1956)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
The variable "call" identifies 911 calls, "badge" identifies officers, "shift" basically identifies a stretch of time in a particular area. The specific minute a call comes in is given by "assignedmin" and the call is considered cleared at the time given by "clearedmin."
I want to count how many officers on a given shift are able to respond to a particular call. For example, for call 14-1234, officer 8456 is assigned at time 1902. How many other officers would have been able to respond to that call? Officer 1098 was preoccupied with a different call from minute 1870 to minute 1910, and so would not have been able to respond to the call occurring at minute 1902. However, based on this simple data set officer 3432 would not have been busy at that time and so would be considered available. Officer 5122 was unoccupied at that time, but was on a different shift and so would not be considered available.
Desired output:
call badge off.sex shift assignedmin clearedmin n_shift n_avail n_unavail n_shift_male n_male_avail
1 14-1234 8456 Male 1 1902 1980 4 2 2 3 1
2 14-4523 1098 Male 1 1870 1910 4 4 0 3 3
3 14-7711 3432 Female 1 1950 1990 4 3 1 3 2
4 14-8199 4750 Male 1 1899 1912 4 3 1 3 2
5 14-3124 5122 Male 2 1907 1956 1 1 1 1 1
I hope this is not too convoluted. Basically, at the time given by assignedmin, an officer is available if he or she is on the same shift and not occupied with another call. I can easily count the number of officers on a shift using dplyr and data.table like so:
dat <- dat %>% group_by(shift) %>% mutate(n_shift = uniqueN(badge),
n_shift_male = uniqueN(badge[off.sex == 'Male']) %>% ungroup()
An option using data.table to count number of officers per shift, then perform a non-equi self join to find out n_unavail and finally, n_avail = n_shift - n_unavail:
library(data.table)
setDT(dat)[, c("n_shift", "n_shift_male") := .(.N, sum(off.sex=="Male")), shift]
dat[, c("n_unavail", "n_male_not_avail") :=
dat[dat, on=.(shift, assignedmin<=assignedmin, clearedmin>=assignedmin),
by=.EACHI, .(.N - 1L, sum(x.off.sex[x.call != i.call]=="Male"))][,
(1L:3L) := NULL]
]
dat[, c("n_avail", "n_male_avail") := .(n_shift - n_unavail, n_shift_male - n_male_not_avail)]
output:
call badge off.sex shift assignedmin clearedmin n_shift n_shift_male n_unavail n_male_not_avail n_avail n_male_avail
1: 14-1234 8456 Male 1 1902 1980 4 3 2 2 2 1
2: 14-4523 1098 Male 1 1870 1910 4 3 0 0 4 3
3: 14-7711 3432 Female 1 1950 1990 4 3 1 1 3 2
4: 14-8199 4750 Male 1 1899 1912 4 3 1 1 3 2
5: 14-3124 5122 Male 2 1907 1956 1 1 0 0 1 1
The n_unavail column can be filled as below. First, I join the table by itself on shift, so that there is a row for every officer combination in the same shift (this can be infeasible if your dataset is large). Then, I calculate whether the _other officer is unavailable at the time of the call, and count them.
dat %>%
left_join(dat, by = "shift", suffix = c("", "_other")) %>%
mutate(unavail = (assignedmin_other < assignedmin & clearedmin_other > assignedmin)) %>%
group_by(call) %>%
summarise(n_avail = sum(!unavail),
n_unavail = sum(unavail))
# call n_avail n_unavail
# <chr> <int> <int>
# 1 14-1234 2 2
# 2 14-3124 1 0
# 3 14-4523 4 0
# 4 14-7711 3 1
# 5 14-8199 3 1
This can be joined to your table to get your desired result.

For each group, for each week, find the sum of the observations in the previous X weeks in R

For each group (individual_id), for each week_id, I want to calculate the number of appearances the individual has made in the previous X weeks in each city.
I have experimented with dplyr to no avail. I have tried a loop but it takes forever on the dataset I am using (with around 250,000 observations of >1000 individuals in 20 cities. Especially as I want to look up the number of appearances in the previous two years (ie. X=104 weeks).
theDates = as.Date(c('07/05/2017','07/05/2017', '07/05/2017', '14/05/2017', '14/05/2017',
'21/05/2017','21/05/2017','21/05/2017', '28/05/2017', '04/06/2017', '04/06/2017', '04/06/2017', '11/06/2017',
'18/06/2017', '18/06/2017'), format='%d/%m/%Y')
someData = data.frame(individual_id = c(1,2,3,2,3,1,2,3,3,1,2,3,3,2,3), week_end_date=theDates,
city=c('Chicago','Chicago','Chicago','Washington', 'Washington', 'Chicago','Chicago', 'Chicago','Washington',
'Washington', 'Washington','Washington','Chicago','Washington', 'Washington'))
someData$nChicagoAppearancesInLastXweeks = NA
someData$nWashingtonAppearancesInLastXweeks = NA
X = 4 # this is the number of weeks for the window length
someData$start_of_period_date = someData$week_end_date - 7*X # this is the start of the range of dates to count appearances over
for (i in 1:dim(someData)[1]) {
WEEK_IDS = seq(someData$start_of_period_date[i], someData$week_end_date[i]-1, by='days')
INDIVIDUAL_ID = someData$individual_id[i]
someData$nChicagoAppearancesInLastXweeks[i] = sum(ifelse(someData$city=='Chicago' & someData$individual_id == INDIVIDUAL_ID & someData$week_end_date %in% WEEK_IDS,1,0))
someData$nWashingtonAppearancesInLastXweeks[i] = with(someData, sum(ifelse(city=='Washington' & individual_id == INDIVIDUAL_ID & week_end_date %in% c(WEEK_IDS),1,0)))
}
The expected output would be two new columns giving the number of times each individual_id appeared in each city in the previous X weeks. The loop code does it, but is clearly not the best way to do this.
Perform a left join for each added column:
library(sqldf)
X <- 4
sql <- "select sum(not b.city is null)
from someData a
left join someData b on
b.city == '$lev' and
a.[individual_id] = b.[individual_id] and
b.[week_end_date] between a.[week_end_date] - 7 * $X and a.[week_end_date] - 1
group by a.rowid"
for(lev in levels(someData$city)) someData[lev] <- fn$sqldf(sql)
giving:
> someData
individual_id week_end_date city Chicago Washington
1 1 2017-05-07 Chicago 0 0
2 2 2017-05-07 Chicago 0 0
3 3 2017-05-07 Chicago 0 0
4 2 2017-05-14 Washington 1 0
5 3 2017-05-14 Washington 1 0
6 1 2017-05-21 Chicago 1 0
7 2 2017-05-21 Chicago 1 1
8 3 2017-05-21 Chicago 1 1
9 3 2017-05-28 Washington 2 1
10 1 2017-06-04 Washington 2 0
11 2 2017-06-04 Washington 2 1
12 3 2017-06-04 Washington 2 2
13 3 2017-06-11 Chicago 1 3
14 2 2017-06-18 Washington 1 1
15 3 2017-06-18 Washington 2 2

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources