Consider this dataset:
mydf <- data.frame(churn_indicator = c(0,0,1,0,1),
resign_date = c(NA,NA,"2011-01-01",NA,"2012-02-01"),
join_date = c("2001-01-01","2001-03-01","2002-04-02",
"2003-09-01","2005-05-10"))
The task is to calculate a vector 'length' which is resign_date - join_date for churn_indicator=1 and Sys.Date()-join_date for churn_indicator =0.
I have already figured out how to do this using a for loop but I want to use something that more efficient(the apply family maybe). Also, is it possible to do this using dplyr's mutate function?
A possible solution :
# convert column from factor/characters to Date (if not already done)
mydf$resign_date <- as.Date(mydf$resign_date)
mydf$join_date <- as.Date(mydf$join_date)
# compute the date differences
days_churn1 <- as.numeric(difftime(mydf$resign_date,mydf$join_date,units='days'))
days_churn0 <- as.numeric(difftime(Sys.Date(),mydf$join_date,units='days'))
# set to zero the values where churn indicator is not what we want
days_churn1[mydf$churn_indicator==0]<-0
days_churn0[mydf$churn_indicator==1]<-0
# sum the two vectors
mydf$length <- days_churn1+days_churn0
> mydf
churn_indicator resign_date join_date length
1 0 <NA> 2001-01-01 5997
2 0 <NA> 2001-03-01 5938
3 1 2011-01-01 2002-04-02 3196
4 0 <NA> 2003-09-01 5024
5 1 2012-02-01 2005-05-10 2458
Alternatively, you can combine some operations using ifelse :
# convert column from factor/characters to Date (if not already done)
mydf$resign_date <- as.Date(mydf$resign_date)
mydf$join_date <- as.Date(mydf$join_date)
mydf$length <-
as.numeric(
ifelse(mydf$churn_indicator==1,
difftime(mydf$resign_date,mydf$join_date,units='days'),
difftime(Sys.Date(),mydf$join_date,units='days')
))
Related
I am working in R and trying to understand the best way to join data frames when one of them is very large.
I have a data frame which is not excruciatingly large but also not small (~80K observations of 8 variables, 144 MB). I need to match observations from this data frame to observations from another smaller data frame on the basis of a date range. Specifically, I have:
events.df <- data.frame(individual=c('A','B','C','A','B','C'),
event=c(1,1,1,2,2,2),
time=as.POSIXct(c('2014-01-01 08:00:00','2014-01-05 13:00:00','2014-01-10 07:00:00','2014-05-01 01:00:00','2014-06-01 12:00:00','2014-08-01 10:00:00'),format="%Y-%m-%d %H:%M:%S"))
trips.df <- data.frame(individual=c('A','B','C'),trip=c('x1A','CA1B','XX78'),
trip_start = as.POSIXct(c('2014-01-01 06:00:00','2014-01-04 03:00:00','2014-01-08 12:00:00'),format="%Y-%m-%d %H:%M:%S"),
trip_end=as.POSIXct(c('2014-01-03 06:00:00','2014-01-06 03:00:00','2014-01-11 12:00:00'),format="%Y-%m-%d %H:%M:%S"))
In my case events.df contains around 80,000 unique events and I am looking to match them to events from the trips.df data frame, which has around 200 unique trips. Each trip has a unique trip identifier ('trip'). I would like to match based on whether the event took place during the date range defining a trip.
First, I have tried fuzzy_inner_join from the fuzzyjoin library. It works great in principal:
fuzzy_inner_join(events.df,trips.df,by=c('individual'='individual','time'='trip_start','time'='trip_end'),match_fun=list(`==`,`>=`,`<=`))
individual.x event time individual.y trip trip_start trip_end
1 A 1 2014-01-01 08:00:00 A x1A 2014-01-01 06:00:00 2014-01-03 06:00:00
2 B 1 2014-01-05 13:00:00 B CA1B 2014-01-04 03:00:00 2014-01-06 03:00:00
3 C 1 2014-01-10 07:00:00 C XX78 2014-01-08 12:00:00 2014-01-11 12:00:00
>
but runs out of memory when I try to apply it to the larger data frames.
Here is a second solution I cobbled together:
trip.match <- function(tripid){
individual <- trips.df$individual[trips$trip==tripid]
start <- trips.df$trip_start[trips$trip==tripid]
end <- trips.df$trip_end[trips$trip==tripid]
tmp <- events.df[events.df$individual==individual &
events.df$time>= start &
events.df$time<= end,]
tmp$trip <- tripid
return(tmp)
}
result <- data.frame(rbindlist(lapply(unique(trips.df$trip),trip.match)
This solution also breaks down because the list object returned by lapply is 25GB and the attempt to cast this list to a data frame also exhausts the available memory.
I have been able to do what I need to do using a for loop. Basically, I append a column onto events.df and loop through the unique trip identifiers and populate the new column in events.df accordingly:
events.df$trip <- NA
for(i in unique(trips.df$trip)){
individual <- trips.df$individual[trips.df$trip==i]
start <- min(trips.df$trip_start[trips.df$trip==i])
end <- max(trips.df$trip_end[trips.df$trip==i])
events.df$trip[events.df$individual==individual & events.df$time >= start & events.df$time <= end] <- i
}
> events.df
individual event time trip
1 A 1 2014-01-01 08:00:00 x1A
2 B 1 2014-01-05 13:00:00 CA1B
3 C 1 2014-01-10 07:00:00 XX78
4 A 2 2014-05-01 01:00:00 <NA>
5 B 2 2014-06-01 12:00:00 <NA>
6 C 2 2014-08-01 10:00:00 <NA>
My question is this: I'm not a very advanced R programmer so I expect there is a more memory efficient way to accomplish what I'm trying to do. Is there?
Try creating a table that expands the trip ranges by hour and then merge with the event. Here is an example (using the data.table function because data.table outperforms data.frame for larger datasets):
library('data.table')
tripsV <- unique(trips.df$trip)
tripExpand <- function(t){
dateV <- seq(trips.df$trip_start[trips.df$trip == t],
trips.df$trip_end[trips.df$trip == t],
by = 'hour')
data.table(trip = t, time = dateV)
}
trips.dt <- rbindlist(
lapply(tripsV, function(t) tripExpand(t))
)
merge(events.df,
trips.dt,
by = 'time')
Output:
time individual event trip
1 2014-01-01 08:00:00 A 1 x1A
2 2014-01-05 13:00:00 B 1 CA1B
3 2014-01-10 07:00:00 C 1 XX78
So you are basically translating the trip table to trip-hour long-form panel dataset. That makes for easy merging with the event dataset. I haven't benchmarked it to your current method but my hunch is that it will be more memory & cpu efficient.
Consider splitting your data with data.table's split and run each subset on fuzzy_inner_join then call rbindlist to bind all data frame elements together for single output.
df_list <- data.table::split(events.df, by="individual")
fuzzy_list <- lapply(df_list, function(sub.df) {
fuzzy_inner_join(sub.df, trips.df,
by = c('individual'='individual', 'time'='trip_start', 'time'='trip_end'),
match_fun = list(`==`,`>=`,`<=`)
)
})
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(df_list); gc()
final_df <- rbindlist(fuzzy_list)
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(fuzzy_list); gc()
I have created an 'xts' object from a data frame - the data frame was loaded from a 'csv' file.
The 'xts' object looks like so :-
entitycode,usage
2016-01-01 1,16521
2016-01-01 2,6589
2016-01-02 1,16540
2016-01-02 2,6687
2016-01-03 1,16269
2016-01-03 2,6642
There are a total of 1462 records in it - 731 each for each of the entitycodes 1 and 2 from 01/01/2016 through to 31/12/2017 with a frequency of 1 day.
Entitycode 1 & 2 refer to different regions say 'region1' and 'region2'.
Is there a way to create separate 'xts' objects (variables) for entitycodes 1 & 2 (or 'region1' and 'region2') each with 731 rows with names like 'region1_xts' and 'region1_xts'?
Best regards
Deepak
I would recommend splitting the xts object resulting in a list of xts objects
split(xts, xts$entitycode)
#$`1`
# entitycode usage
#2016-01-01 1 16521
#2016-01-02 1 16540
#2016-01-03 1 16269
#
#$`2`
# entitycode usage
#2016-01-01 2 6589
#2016-01-02 2 6687
#2016-01-03 2 6642
You can then use functions of the *apply family to easily operate on the different list elements (i.e. the xts objects).
Sample data
df <- read.csv(text =
" date,entitycode,usage
2016-01-01, 1,16521
2016-01-01, 2,6589
2016-01-02, 1,16540
2016-01-02, 2,6687
2016-01-03, 1,16269
2016-01-03, 2,6642", header = T)
mat <- as.matrix(df[, -1])
rownames(mat) <- df[, 1]
colnames(mat) <- colnames(df)[-1]
xts <- as.xts(mat)
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03
Much appreciated for your comment/answer.
Context: I have a large data table of daily prices of swap rates across a dozen countries. The columns are [ID, Date, X1Y, X2Y, X3Y ... X30Y], where X..Y are columns indicating the part of the yield curve (e.g. X1Y is 1-year swap, X3Y is 3-year swap). The two keys are ID (e.g. "AUD", "GBP") and Date (e.g. "2001-04-13", "2001-04-16").
Dummy Data:
set.seed(123)
dt <- cbind(ID=rep(c("AUD","GBP"),c(100,100)),X1Y=rnorm(200),X2Y=rnorm(200),X3Y=rnorm(200))
dt <- data.table(dt)
dt[,Date := seq(from=as.IDate("2013-01-01"), by="1 day", length.out=100)]
setkeyv(dt,c("ID","Date"))
Problem 1:
First generate some dummy signals. What's the syntax if there is 100 columns with fairly complicated signal generation formula coded in a separate function say genSig(X1Y)? Here's what I mean using just the 3 columns and some meaningless formula:
dt[,SIG1 :=c(0, diff(X1Y ,1)),by="ID"]
dt[,SIG2 :=c(0, diff(X2Y ,1)),by="ID"]
dt[,SIG3 :=c(0, diff(X3Y ,1)),by="ID"]
Problem 2:
Carry forward column(s) based on "middle of the month". For example, using the SIG columns, I'd like to make everything after say the 15th of each month the same as the signal on the 15th, until next month's 15th. The tricky thing is that the actual data contains just trading days so some months do not have 15th if it is a weekend/holiday. Another issue is using an efficient syntax, I could achieve something similar using loop (I know..) for the start of each month just to show what I meant:
for (i in 2:length(dt$Date)){
if(as.POSIXlt(dt[i,]$Date)$mon == as.POSIXlt(dt[i-1,]$Date)$mon){
dt[i, SIG1 := dt[i-1,SIG1]]
dt[i, SIG2 := dt[i-1,SIG2]]
dt[i, SIG3 := dt[i-1,SIG3]]
}
}
I can't figure out how to deal with the "mid-month" issue since it can fall on the 15th or 16th or 17th. Like Problem 1, would appreciate if there is a smart way to insert/update multiple/dozen columns.
As far as problem 2 goes, you can use rolling joins:
# small sample to demonstrate
dt = data.table(date = as.Date(c('2013-01-01', '2013-01-15', '2013-01-17', '2013-02-14', '2013-02-17'), '%Y-%m-%d'), val = 1:5)
dt
# date val
#1: 2013-01-01 1
#2: 2013-01-15 2
#3: 2013-01-17 3
#4: 2013-02-14 4
#5: 2013-02-17 5
setkey(dt, date)
midmonth = seq(as.Date('2013-01-15', '%Y-%m-%d'),
as.Date('2013-12-15', '%Y-%m-%d'),
by = '1 month')
dt[, flag := 0]
dt[J(midmonth), flag := 1, roll = -Inf]
dt
# date val flag
#1: 2013-01-01 1 0
#2: 2013-01-15 2 1
#3: 2013-01-17 3 0
#4: 2013-02-14 4 0
#5: 2013-02-17 5 1
And now you can cumsum the flag to obtain the grouping you want to e.g. do:
dt[, val1 := val[1], by = cumsum(flag)]
dt
# date val flag val1
#1: 2013-01-01 1 0 1
#2: 2013-01-15 2 1 2
#3: 2013-01-17 3 0 2
#4: 2013-02-14 4 0 2
#5: 2013-02-17 5 1 5
# problem 1
nsig <- 3L
csig <- 1:nsig+1L
newcols <- paste('SIG',1:nsig,sep='')
dt[,(newcols):=0]
for (j in csig) set(dt,j=j+nsig+1L,value=c(0, diff(dt[[j]],1)))
After looking at #eddi's answer, I see that set is not so useful for problem 2. Here's what I would do:
dt[,(newcols):=lapply(newcols,function(x) get(x)[1]),by=list(ID,month(Date-14))]
According to this answer, you can subtract days from a date in this way.
Aside. Cbind-ing vectors makes a matrix. In your example, you've got a character matrix. I think you were looking for...
# Creating better data...
set.seed(123)
dt <- data.table(ID=rep(c("AUD","GBP"),c(100,100)),
X1Y=rnorm(200),X2Y=rnorm(200),X3Y=rnorm(200),
Date=seq(from=as.IDate("2013-01-01"), by="1 day", length.out=100))
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))