I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.
Related
I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:
> head(my_data)
client_id contract_id contract_start contract_end inter_complex
1 1 15/07/2019 15/07/2020 18092+18458i
3 3 1/01/2015 1/01/2015 16436+16436i
5 5 12/06/2020 12/06/2020 18425+18425i
13 13 1/01/2015 1/01/2015 16436+16436i
18 18 1/01/2015 1/01/2015 16436+16436i
19 19 1/01/2015 1/01/2015 16436+16436i
Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:
library(tidyverse)
library(lubridate)
chars_2_cplex = function(start, end) {
cbind(start, end) %>%
apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
}
my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))
What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.
I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:
## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
start = x %>% Re() %>% as.integer()
end = x %>% Im() %>% as.integer()
interval(as_date(start), as_date(end))
}
## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
max_m = length(x)
output = vector(mode = "list", length = max_m - 1)
for (i in 2:max_m) {
output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>%
mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
}
return(output)
}
In order to get a better grasp on what the function get_intersections does, I propose that you run the following:
example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()
The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.
You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.
Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:
my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))
Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.
Edit
Regarding Wimpel's answer, I have examined their data table and I found this.
> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
contract_start contract_end contract_intersect
1: 2018-01-11 2019-01-11 7
That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.
On the other hand, let's see if this holds true when using my combinations-based approach.
combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>%
combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>%
rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
ungroup() %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))
> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
id name value simultaneous
<int> <chr> <Interval> <int>
1 24311 V1 2018-01-11 UTC--2019-01-11 UTC 10
2 24311 V2 2018-03-01 UTC--2019-03-01 UTC 10
3 24311 V3 2018-07-11 UTC--2019-07-11 UTC 10
4 24311 V4 2018-04-20 UTC--2019-04-20 UTC 10
5 24311 V5 2018-05-21 UTC--2019-05-21 UTC 10
6 24311 V6 2018-08-10 UTC--2019-08-10 UTC 10
7 24311 V7 2018-08-09 UTC--2019-08-09 UTC 10
8 24311 V8 2018-09-27 UTC--2019-09-27 UTC 10
9 24311 V9 2020-01-03 UTC--2021-01-03 UTC 10
10 24311 V10 2019-12-19 UTC--2020-12-19 UTC 10
The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.
I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.
I believe you are looking for something like this?
library(data.table)
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]
setkey(DT, V1)
DT[DT, c("contract_intersect", "contract_intersect_ids") := {
val = DT[ !V1 == i.V1 & client_id == i.client_id &
contract_start <= i.contract_end & contract_end >= i.contract_start, ]
list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]
# V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1: 1 1 1 2019-07-15 2020-07-15 18092+18458i 2 4162;4168
# 2: 2 3 3 2015-01-01 2015-01-01 16436+16436i 0
# 3: 3 5 5 2020-06-12 2020-06-12 18425+18425i 0
# 4: 4 13 13 2015-01-01 2015-01-01 16436+16436i 0
# 5: 5 18 18 2015-01-01 2015-01-01 16436+16436i 0
# 6: 6 19 19 2015-01-01 2015-01-01 16436+16436i 0
I have a table of about 50 000 rows, with four columns.
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.
So its result should look like:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.
Here is a data.table solution using foverlaps.
First, notice that there's an error in your data:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
run foverlaps and then
filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
Add a variable where we count the male simultaneous guests and
a variable where we count the female simultaneous guests, all grouped by ID and arrival
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
Join the findings of the previous command with our original table on ID and arrival
result <- simultaneous[dt, on = .(ID, Arrival)]
<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
set the column order to something nicer
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.
It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.
It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.
Any questions about the code, lemme know.
There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...
Final output looks like this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28
I am a beginner in R, and I would like to do a survival analysis on the dataset about light bulbs I have. I would like to calculate the lifetime of a light bulb, so I need to calculate the time period between date_broken in row 2 and date_solved in row 1 for example.
I know I can use difftime(time, time2, units = "days") to calculate the time between date_fixed and date_broken in the same row, but then I would calculate the time the light bulb was broken and that is not what I am interested in.
I provided a small sample of my data below. For each light bulb on a particular location I have information about the date it broke and the day it was fixed.
(Besides the columns given in the example below, I have other features that should have predictive value.)
# date_broken date_fixed lightbulb location
# 1 26-2-2015 17-3-2015 1 A
# 2 19-3-2015 26-3-2015 1 A
# 3 26-3-2015 26-3-2015 1 A
# 4 17-4-2015 29-4-2015 2 B
# 5 19-6-2015 25-6-2015 2 B
# 6 9-7-2015 30-7-2015 2 B
ds <- data.frame( date_broken = c("26-2-2015", "19-3-2015",
"26-3-2015", "17-4-2015",
"19-6-2015", "9-7-2015"),
date_fixed = c("17-3-2015", "26-3-2015", "26-3-2015", "29-4-2015", "25-6-2015", "30-7-2015"),
lightbulb = c("1`", "1", "1", "2", "2", "2"), location = c("A", "A", "A", "B", "B", "B"))
First you'll need to fix your dates, as #Gaurav suggested. Then, you'll need to summarize by lightbulb, or the difference will be meaningless.
I present here an alternative using packages lubridate and data.table:
library(lubridate)
library(data.table)
ds$date_broken <- dmy(ds$date_broken)
ds$date_fixed <- dmy(ds$date_fixed)
setDT(ds)
setDT(ds)[, dt := difftime(date_fixed, shift(date_broken, 1L, type="lag"), "days"), by = lightbulb]
ds
Which produces:
## date_broken date_fixed lightbulb location dt
## 1: 2015-02-26 2015-03-17 1 A NA days
## 2: 2015-03-19 2015-03-26 1 A 28 days
## 3: 2015-03-26 2015-03-26 1 A 7 days
## 4: 2015-04-17 2015-04-29 2 B NA days
## 5: 2015-06-19 2015-06-25 2 B 69 days
## 6: 2015-07-09 2015-07-30 2 B 41 days
For a future opportunity, it's a lot of help when you produce some expected results, along with your question.
This should help
library(dplyr)
ds2 <- ds %>%
group_by(lightbulb) %>%
mutate(tp = as.Date(date_broken, "%d-%m-%Y") -
as.Date(lag(date_fixed,1), "%d-%m-%Y"))
I really love those super-impressive pipe operators in R. They're so elegant, and great if someone's got a ready-to-go solution.
I mostly do loops, probably 'cos I like something I see what's going on, and I can debug as I go. (I was also brought up on BASIC some decades ago - but don't tell anyone.)
Anyway this was my approach for something very similar that I was doing, with hopefully a bit of added value using a sequence counter. This might be useful as a regression variable (covariate or stratification) or something by which you can subset, assuming you might for example want to look at later failures separately from earlier failures. Survival curves by sequence counter were quite informative in my work.
First convert the dates...
## convert dates. once done it's done
ds$date_broken <- as.Date(ds$date_broken, "%d-%m-%Y")
ds$date_fixed <- as.Date(ds$date_fixed, "%d-%m-%Y")
Add a sequence counter column (called seq) to keep track of number of failures
ds$seq <- 1
Populate that counter column
for (rdx in 2:nrow(ds)) {
## if same item, increment count. If new item, start new count at seq = 1
ifelse(ds$lightbulb[rdx] == ds$lightbulb[rdx-1], ds$seq[rdx] <- ds$seq[rdx-1]+1, 1)
}
Now add a difference column
ds$diff <- NA
Populate that difference column
for (rdx in 2:nrow(ds)) {
## if same item, difference is current failure date minus previous in-service date
ifelse(ds$seq[rdx] != 1, ds$diff[rdx] <- ds$date_broken[rdx] - ds$date_fixed[rdx-1], NA)
}
Well that worked for me, both to conceptualise and to implement. Please note that some folk do get a bit hung up with loops (http://paleocave.sciencesortof.com/2013/03/writing-a-for-loop-in-r/), but around my neck of the woods life is for living, not coding, and even I sometimes stir paint with a screwdriver (an old one though).