I have a panel with 50 subjects and one observation per subject per year, 2007-2020. I need to construct a multiple-event survival model, where events can happen more than once per subject.
In terms of variables, I have a dummy for the event, a variable indicating the year, a time_overall variable that shows the total time that has passed (ranges from 1-14 and increases irrespective of the event, each increase by one indicates a year passing), and a time_conditional variable that shows the time between the last two events.
I want to create start and stop time variables based on time overall, the event, and the time conditional. I wrote
for (row in c(1:nrow(df))) {
if (df$time_overall[row] == 1 & df$event[row] == 1) {
df$time[row] <- 1
df$time_start[row] <- 0
df$time_stop[row] <- 1
}
if (df$time_overall[row] > 1 & df$event == 1) {
df$time_start[row] <-
df$time_overall[row]-df$time_conditional[row]
df$time_stop[row] <- df$time_overall[row]
df$time[row] <- df$time_stop[row]-df$time_start[row]
}
}
but most of the time_start values change to zero, leading to hundreds of observations being dropped when I construct the model.
Any idea why this might be, or an alternative approach that can create start and stop time variables based on the information I've presented?
Related
I am simulating plant growth in r.
Vector A is a 10x1 vector that holds the value for a plant mass. Each row represents an hour. At hour zero, the first row is set to 5 while the remaining 9 are made up of zeros. At each hour interval, a new variable is added to the next available row.
A=matrix(nrow=10,ncol=1,0)
A[1]=5 #setting the plant's mass at time zero
At each time interval, I want to use the plants mass at the start of the interval as a constant and then re-calculate the new mass at the end of the interval.
I have created a function that uses plant mass from Vector A as a constant as follows:
```{r} StartRespire=function(A,Respira=66.7,Cgain){ #note Cgain is a constant determined in another loop
RespCost=A*Respira
if(RespCost>Cgain){
NewA=0
print("Not enough C for respiration. Plant dies")
}
else{
NewA=A*Cgain }
return(list(NewA=NewA))
}
My questions are:
Is it possible to make the function automatically chose the last non zero value in Vector A?
Is there a way to make the function define NewA as the next available value (add to the next row)
I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")
So, I have a time series with a number of observations per ID. I measure a 1/0 variable at each time point. I want to know how many times a person switches from 1 to 0 or 0 to 1.
There are missing data at random in each ID.
id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
indicator=c(1,0,0,0,1,1,NA,1,1,0,1,0,NA,1,1,0)
timepoint=c(2003,2004,2005,2006)
td = data.frame(id,timepoint,indicator)
I need code to return the number of switches per person in these data.
To count the number switch from 0 to 1 or 1 to 0 all you need to do is shifting your vector and compare the two shifted versions. And they applying this function by id assuming your data is already sorted in chronological order.
library(data.table)
count.switch=function(x) sum(tail(x,-1)!= head(x,-1),na.rm=T)
td[,count.switch(indicator),keyby=.(id)]
With this method you could specify the switch you would like to count
count.switch.01=function(x) sum(tail(x,-1) == 0 & head(x,-1) ==1 ,na.rm=T)
You could also count both with a (==0 & ==1) | (==1 & ==0)
Another trick would be to count when the shifted vectors add up to 1 element wise. They only do when one element is 0 and the other 1.
count.switch=function(x) sum(tail(x,-1) + head(x,-1) == 1,na.rm=T)
I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.
As an intermediate R user, I know that for loops can very often be optimized by using functions like apply or otherwise. However, I am not aware of functions that can optimize my current code to generate a markov chain matrix, which is running quite slowly. Have I max-ed out on speed or are there things that I am overlooking? I am trying to find the transition matrix for a Markov chain by counting the number of occurrences in 24-hour time periods before given alerts. The vector ids contains all possible id's (about 1700).
The original matrix looks like this, as an example:
>matrix
id time
1 1376084071
1 1376084937
1 1376023439
2 1376084320
2 1372983476
3 1374789234
3 1370234809
And here is my code to try to handle this:
matrixtimesort <- matrix[order(-matrix$time),]
frequency = 86400 #number of seconds in 1 day
# Initialize matrix that will contain probabilities
transprobs <- matrix(data=0, nrow=length(ids), ncol=length(ids))
# Loop through each type of event
for (i in 1:length(ids)){
localmatrix <- matrix[matrix$id==ids[i],]
# Loop through each row of the event
for(j in 1:nrow(localmatrix)) {
localtime <- localmatrix[j,]$time
# Find top and bottom row number defining the 1-day window
indices <- which(matrixtimesort$time < localtime & matrixtimesort$time >= (localtime - frequency))
# Find IDs that occur within the 1-day window
positiveids <- unique(matrixtimesort[c(min(indices):max(indices)),]$id)
# Add one to each cell in the matrix that corresponds to the occurrence of an event
for (l in 1:length(positiveids)){
k <- which(ids==positiveids[l])
transprobs[i,k] <- transprobs[i,k] + 1
}
}
# Divide each row by total number of occurrences to determine probabilities
transprobs[i,] <- transprobs[i,]/nrow(localmatrix)
}
# Normalize rows so that row sums are equal to 1
normalized <- transprobs/rowSums(transprobs)
Can anyone make any suggestions to optimize this for speed?
Using nested loops seems a bad idea. Your code can be vectorized to speed up.
For example, why find the top and bottom of row numbers? You can simply compare the time value with "time_0 + frequency": it is a vectorized operation.
HTH.