I am a beginner in R, and I would like to do a survival analysis on the dataset about light bulbs I have. I would like to calculate the lifetime of a light bulb, so I need to calculate the time period between date_broken in row 2 and date_solved in row 1 for example.
I know I can use difftime(time, time2, units = "days") to calculate the time between date_fixed and date_broken in the same row, but then I would calculate the time the light bulb was broken and that is not what I am interested in.
I provided a small sample of my data below. For each light bulb on a particular location I have information about the date it broke and the day it was fixed.
(Besides the columns given in the example below, I have other features that should have predictive value.)
# date_broken date_fixed lightbulb location
# 1 26-2-2015 17-3-2015 1 A
# 2 19-3-2015 26-3-2015 1 A
# 3 26-3-2015 26-3-2015 1 A
# 4 17-4-2015 29-4-2015 2 B
# 5 19-6-2015 25-6-2015 2 B
# 6 9-7-2015 30-7-2015 2 B
ds <- data.frame( date_broken = c("26-2-2015", "19-3-2015",
"26-3-2015", "17-4-2015",
"19-6-2015", "9-7-2015"),
date_fixed = c("17-3-2015", "26-3-2015", "26-3-2015", "29-4-2015", "25-6-2015", "30-7-2015"),
lightbulb = c("1`", "1", "1", "2", "2", "2"), location = c("A", "A", "A", "B", "B", "B"))
First you'll need to fix your dates, as #Gaurav suggested. Then, you'll need to summarize by lightbulb, or the difference will be meaningless.
I present here an alternative using packages lubridate and data.table:
library(lubridate)
library(data.table)
ds$date_broken <- dmy(ds$date_broken)
ds$date_fixed <- dmy(ds$date_fixed)
setDT(ds)
setDT(ds)[, dt := difftime(date_fixed, shift(date_broken, 1L, type="lag"), "days"), by = lightbulb]
ds
Which produces:
## date_broken date_fixed lightbulb location dt
## 1: 2015-02-26 2015-03-17 1 A NA days
## 2: 2015-03-19 2015-03-26 1 A 28 days
## 3: 2015-03-26 2015-03-26 1 A 7 days
## 4: 2015-04-17 2015-04-29 2 B NA days
## 5: 2015-06-19 2015-06-25 2 B 69 days
## 6: 2015-07-09 2015-07-30 2 B 41 days
For a future opportunity, it's a lot of help when you produce some expected results, along with your question.
This should help
library(dplyr)
ds2 <- ds %>%
group_by(lightbulb) %>%
mutate(tp = as.Date(date_broken, "%d-%m-%Y") -
as.Date(lag(date_fixed,1), "%d-%m-%Y"))
I really love those super-impressive pipe operators in R. They're so elegant, and great if someone's got a ready-to-go solution.
I mostly do loops, probably 'cos I like something I see what's going on, and I can debug as I go. (I was also brought up on BASIC some decades ago - but don't tell anyone.)
Anyway this was my approach for something very similar that I was doing, with hopefully a bit of added value using a sequence counter. This might be useful as a regression variable (covariate or stratification) or something by which you can subset, assuming you might for example want to look at later failures separately from earlier failures. Survival curves by sequence counter were quite informative in my work.
First convert the dates...
## convert dates. once done it's done
ds$date_broken <- as.Date(ds$date_broken, "%d-%m-%Y")
ds$date_fixed <- as.Date(ds$date_fixed, "%d-%m-%Y")
Add a sequence counter column (called seq) to keep track of number of failures
ds$seq <- 1
Populate that counter column
for (rdx in 2:nrow(ds)) {
## if same item, increment count. If new item, start new count at seq = 1
ifelse(ds$lightbulb[rdx] == ds$lightbulb[rdx-1], ds$seq[rdx] <- ds$seq[rdx-1]+1, 1)
}
Now add a difference column
ds$diff <- NA
Populate that difference column
for (rdx in 2:nrow(ds)) {
## if same item, difference is current failure date minus previous in-service date
ifelse(ds$seq[rdx] != 1, ds$diff[rdx] <- ds$date_broken[rdx] - ds$date_fixed[rdx-1], NA)
}
Well that worked for me, both to conceptualise and to implement. Please note that some folk do get a bit hung up with loops (http://paleocave.sciencesortof.com/2013/03/writing-a-for-loop-in-r/), but around my neck of the woods life is for living, not coding, and even I sometimes stir paint with a screwdriver (an old one though).
Related
I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.
Suppose I have a series of observations representing date intervals, e.g.
library(dplyr)
library(magrittr)
df <-
data_frame(start = as.Date(c('2000-01-01', '2000-01-03', '2000-01-08',
'2000-01-20', '2000-01-22')),
end = as.Date(c('2000-01-02', '2000-01-05', '2000-01-10',
'2000-01-21', '2000-02-10')))
I would like to group these observations such that the start time of observation n occurs within some specified interval following the end date of observation n-1. For instance, if we set that interval to be 5 days, we would see something like:
# start end group
# (date) (date) (dbl)
# 1 2000-01-01 2000-01-02 1
# 2 2000-01-03 2000-01-05 1
# 3 2000-01-08 2000-01-10 1
# 4 2000-01-20 2000-01-21 2
# 5 2000-01-22 2000-02-10 2
(For the sake of simplicity, I'm assuming no overlap in dates, although this isn't necessarily the case in the data). I thought about using igraph to create a weighted edgelist, but that seemed overly complicated. Efficiency is, I believe, important: I'll be running this on roughly 4 million groups of data of about 5-10 rows each.
While my solution does work, to me it seems error-prone, slow, and clunky. I'm thinking using a package or some vectorization would really improve matters.
group_dates <- function(df, interval){
# assign first date to first group
df %<>% arrange(start, end)
df[1, 'group'] <- 1
# for each start date, determine if it is within `interval` days of the
# closest end date
lapply(df$start[-1], function(cur_start){
earlier_data <- df[df$end <= cur_start, ]
diffs <- cur_start - earlier_data$end
min_interval <- diffs[which.min(diffs)]
closest_group <- earlier_data$group[which.min(diffs)]
if(min_interval <= interval){
df[df$start == cur_start, 'group'] <<- closest_group
} else {
df[df$start == cur_start, 'group'] <<- closest_group + 1
}
})
return(df)
}
You can do that relatively easily with dplyr.
The idea is the following:
Lag the end data (shifting it down by one)
Calculate the difference between start date and the lagged end date
Adding 'BreakPoints' - A variable with TRUE when the difference is more than 5 days and FALSE otherwise
Calculating the cumulative sum of this break-point. This will add 1 every time it find a new breakpoint so a new interval should be started
Something like this should work for you:
df %>%
mutate(lagged_end = lag(end),
diff = start - lagged_end,
new_interval = diff > 5,
new_interval = ifelse(is.na(new_interval), FALSE, new_interval),
interval_number = cumsum(new_interval))
This should be also quite quick since it's all in dplyr
This isn't as elegant as Lorenzo Rossi's solution, but offers a slightly different approach using cut.Date and 2 lines of code:
breakpoints <- c(FALSE, sapply(2:nrow(df), function(x) df[x,"start"] - df[x-1,"end"]) > 5)
clusterLabels <- as.numeric(cut.Date(df$start, c(min(df$start), df[breakpoints, "start"], max(df$start)+1)))
I have a dataset with dates occurring randomly. For example:
10/21/15, 11/21/15, 11/22/15, 11/28/15,11/30/15, 12/12/15...etc
I am looking to create a rolling average by time-period NOT by at the observation level. For instance if I wanted to do a moving average of the last 7 days. I would not want to look up at the last 7 rows, but rather the last 7 days
For a tiny example:
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5,10,5,3,2)
df=data.frame(dates, value)
df$desired = c(NA,5,7.5, NA,NA)
I am obviously looking to do this for much larger dataset, but I hope you get the idea. If I was to use 7 days for example this is the result I would expect.
Notice that I don't include the current observations value into the rolling average, only the previous. I want rolling average by time period, not observation row number.
I tried looking at rollmean and dplyr but I couldnt figure it out. I don't really care how it happens though.
Thanks!
try this:
rollavgbyperiod <- function(i,window){
startdate <- dates[i]-window
enddate <- dates[i]-1
interval <- seq(startdate,enddate,1)
tmp <- value[dates %in% interval]
return(mean(tmp))
}
dates <- as.Date(dates)
window <- 7
res <- sapply(1:length(dates),function(m) rollavgbyperiod(m,window))
res[is.nan(res)] <- NA
> data.frame(dates,value,res)
dates value res
1 2015-08-07 5 NA
2 2015-08-08 10 5.0
3 2015-08-09 5 7.5
4 2015-09-09 3 NA
5 2015-10-10 2 NA
I suggest using runner package in this case. What is needed here is mean_run with k = 7 window, lagged by 1 period. Simple one-liner:
library(runner)
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5, 10, 5, 3, 2)
mean_run(x = value, k = 7, lag = 1, idx = as.Date(dates))
#[1] NA 5.0 7.5 NA NA
Check package and function documentation
I have monthly weight observations and daily returns, and I am trying to compute a geometric return for every day in a month. It might be easier to see the pattern:
How do I reproduce the "desired output" column? Either a solution from the base functions in R or any package suggestions are appreciated!
Edit 1:
Thank you.
Here is some sample data and the solution that I have been working on:
set.seed(33)
z <- c(.35,NA,NA,NA,.2,NA,NA)
z1 <- c(.35,.35,.35,.35,.2,.2,.2)
z2 <- rnorm(7)
zCbind <- data.frame(cbind(z,z1,z2))
colnames(zCbind) <- c("months","na.locf(months)","values")
solution1 <- ifelse(zCbind[,1] == zCbind[,2],
zCbind[,1], # if TRUE
zCbind[,2]*apply(zCbind[,3],2,cumprod)) # if FALSE
I know my problem is in the false condition. Solutions that I have tried are:
replace cumprod with the prod function
changed the format of zCbind[,3] by binding or converting it matrix/df
this looked promising, but i can't find any more literature on the "cumprod.column" wrappers to the cumprod function: http://braverock.com/brian/R/PerformanceAnalytics/html/cum.utils.html
How about this with plyr::ddply()
I recreated your data to make it more like the original format
sheet<-data.frame(date=as.Date(1:100,origin="2012-01-01"),
weight=rep(NA,100),
increment=rnorm(100,0,0.5)/100
)
#get the latest date in each month to replace the NAs
last_days<-ddply(sheet,.(month=format(date,"%Y-%b")),summarise,last_day=max(date))
sheet[sheet$date %in% last_days$last_day,]$weight<-runif(nrow(last_days))/2
#now we have a table which matches your data
#set the NA's to 0
sheet$weight[is.na(sheet$weight)]<-0
# OK so here you add your seed value for the first month (0.4 in this example)
# and shift forward into the last month
sheet$shift<-c(0.4,sheet$weight[1:nrow(sheet)-1])
sheet.out<-
ddply(sheet,
.(month=format(date,"%Y-%b")),
summarise,
date=date,
inc=increment,
output=cumprod(ifelse(shift==0,1+increment,max(shift)*(1+increment))) #cum product of seed val and day rets
)
# and lastly update the last days to be the original weight
sheet.out$output<-ifelse(sheet$weight!=0,sheet$weight,sheet.out$output)
head(sheet.out)
# month date inc output
#1 2012-Apr 2012-04-01 0.0018504578 0.3234371
#2 2012-Apr 2012-04-02 0.0017762242 0.3240116
#3 2012-Apr 2012-04-03 0.0091980829 0.3269919
#4 2012-Apr 2012-04-04 -0.0023334368 0.3262289
#5 2012-Apr 2012-04-05 0.0042003969 0.3275992
#6 2012-Apr 2012-04-06 0.0005409113 0.3277764
Much appreciated for your comment/answer.
Context: I have a large data table of daily prices of swap rates across a dozen countries. The columns are [ID, Date, X1Y, X2Y, X3Y ... X30Y], where X..Y are columns indicating the part of the yield curve (e.g. X1Y is 1-year swap, X3Y is 3-year swap). The two keys are ID (e.g. "AUD", "GBP") and Date (e.g. "2001-04-13", "2001-04-16").
Dummy Data:
set.seed(123)
dt <- cbind(ID=rep(c("AUD","GBP"),c(100,100)),X1Y=rnorm(200),X2Y=rnorm(200),X3Y=rnorm(200))
dt <- data.table(dt)
dt[,Date := seq(from=as.IDate("2013-01-01"), by="1 day", length.out=100)]
setkeyv(dt,c("ID","Date"))
Problem 1:
First generate some dummy signals. What's the syntax if there is 100 columns with fairly complicated signal generation formula coded in a separate function say genSig(X1Y)? Here's what I mean using just the 3 columns and some meaningless formula:
dt[,SIG1 :=c(0, diff(X1Y ,1)),by="ID"]
dt[,SIG2 :=c(0, diff(X2Y ,1)),by="ID"]
dt[,SIG3 :=c(0, diff(X3Y ,1)),by="ID"]
Problem 2:
Carry forward column(s) based on "middle of the month". For example, using the SIG columns, I'd like to make everything after say the 15th of each month the same as the signal on the 15th, until next month's 15th. The tricky thing is that the actual data contains just trading days so some months do not have 15th if it is a weekend/holiday. Another issue is using an efficient syntax, I could achieve something similar using loop (I know..) for the start of each month just to show what I meant:
for (i in 2:length(dt$Date)){
if(as.POSIXlt(dt[i,]$Date)$mon == as.POSIXlt(dt[i-1,]$Date)$mon){
dt[i, SIG1 := dt[i-1,SIG1]]
dt[i, SIG2 := dt[i-1,SIG2]]
dt[i, SIG3 := dt[i-1,SIG3]]
}
}
I can't figure out how to deal with the "mid-month" issue since it can fall on the 15th or 16th or 17th. Like Problem 1, would appreciate if there is a smart way to insert/update multiple/dozen columns.
As far as problem 2 goes, you can use rolling joins:
# small sample to demonstrate
dt = data.table(date = as.Date(c('2013-01-01', '2013-01-15', '2013-01-17', '2013-02-14', '2013-02-17'), '%Y-%m-%d'), val = 1:5)
dt
# date val
#1: 2013-01-01 1
#2: 2013-01-15 2
#3: 2013-01-17 3
#4: 2013-02-14 4
#5: 2013-02-17 5
setkey(dt, date)
midmonth = seq(as.Date('2013-01-15', '%Y-%m-%d'),
as.Date('2013-12-15', '%Y-%m-%d'),
by = '1 month')
dt[, flag := 0]
dt[J(midmonth), flag := 1, roll = -Inf]
dt
# date val flag
#1: 2013-01-01 1 0
#2: 2013-01-15 2 1
#3: 2013-01-17 3 0
#4: 2013-02-14 4 0
#5: 2013-02-17 5 1
And now you can cumsum the flag to obtain the grouping you want to e.g. do:
dt[, val1 := val[1], by = cumsum(flag)]
dt
# date val flag val1
#1: 2013-01-01 1 0 1
#2: 2013-01-15 2 1 2
#3: 2013-01-17 3 0 2
#4: 2013-02-14 4 0 2
#5: 2013-02-17 5 1 5
# problem 1
nsig <- 3L
csig <- 1:nsig+1L
newcols <- paste('SIG',1:nsig,sep='')
dt[,(newcols):=0]
for (j in csig) set(dt,j=j+nsig+1L,value=c(0, diff(dt[[j]],1)))
After looking at #eddi's answer, I see that set is not so useful for problem 2. Here's what I would do:
dt[,(newcols):=lapply(newcols,function(x) get(x)[1]),by=list(ID,month(Date-14))]
According to this answer, you can subtract days from a date in this way.
Aside. Cbind-ing vectors makes a matrix. In your example, you've got a character matrix. I think you were looking for...
# Creating better data...
set.seed(123)
dt <- data.table(ID=rep(c("AUD","GBP"),c(100,100)),
X1Y=rnorm(200),X2Y=rnorm(200),X3Y=rnorm(200),
Date=seq(from=as.IDate("2013-01-01"), by="1 day", length.out=100))