How to conditionally compute difference in column values between rows in R? - r

I have the following data set
(this is just sample, actual data set runs into rows)
Image of the data set in also attached in the snapshot
Dataset snapshot
User Time Flag TimeDifference Expected o/p (Seconds)
A 11:39:30 1
A 11:37:53 1
A 20:44:19 1
A 22:58:42 2 Calculate time difference? 8063
A 23:01:54 1 Calculate time difference? 192
B 23:03:00 1
B 23:03:33 1
B 23:03:53 1
B 15:00:42 3 Calculate time difference 28991
B 19:35:31 2 Calculate time difference 16489
B 19:35:34 1 Calculate time difference 3
C 10:19:06 1
C 10:59:50 1
C 10:59:50 1
C 12:16:36 1
C 12:16:36 1
I need to calculate for each user
the time difference (in seconds) between rows whenever there is a 'Flag change' and store it in a new column called 'Time difference'
i.e. whenever flag changes from 1 to 2 , or 2 to 3 , or 2 to 1, or 3 to 1 , I need to compute time difference the time column between the current row and preceding row when flag change in encountered.
I have time in hh:mm:ss format.
Is there any for-loop function I can apply here?
Help appreciated.

One way to do that is to turn your time variable into POSIXlt time object, calculate the time difference (for all rows) against a shifted time variable. Then use your flag variable to NA the ones you dont want. The important part is you need to diff the flag variable so you know when your flag has changed
I'm laying out all the steps here, but theres probably a quicker way to do it:
# Create the data
flag <- c(1,1,1,2,1,1,1,1,3,2,1,1,1,1,1,1)
time <- c('11:39:30','11:37:53','20:44:19','22:58:42','23:01:54',
'23:03:00','23:03:33','23:03:53','15:00:42','19:35:31',
'19:35:34','10:19:06','10:59:50','10:59:50','12:16:36',
'12:16:36')
# Shift the time
time_shift <- c(NA,time[1:length(time)-1])
# Turn into POSIXlt objects
time <- strptime(time, format='%H:%M:%S')
time_shift <- strptime(time_shift, format='%H:%M:%S')
data <- data.frame(time, time_shift, flag)
# Calculate diffs
data$time_diff <- as.numeric(abs(difftime(data$time, data$time_shift, units=('secs'))))
data$flag_diff <- c(NA,abs(diff(data$flag)))
# Set non 'flag change' diffs to NA
data$time_diff[data$flag_diff == 0] <- NA
You'll probably want to remove the useless columns and convert time back into your original representation, which you can do with:
data$time <- format(data$time, "%H:%M:%S")
data <- data[c('time', 'flag', 'time_diff')]
That will result in a dataframe that looks like this:
time flag time_diff
1 11:39:30 1 NA
2 11:37:53 1 NA
3 20:44:19 1 NA
4 22:58:42 2 8063
5 23:01:54 1 192
6 23:03:00 1 NA
7 23:03:33 1 NA
8 23:03:53 1 NA
9 15:00:42 3 28991
10 19:35:31 2 16489
11 19:35:34 1 3
12 10:19:06 1 NA
13 10:59:50 1 NA
14 10:59:50 1 NA
15 12:16:36 1 NA
16 12:16:36 1 NA

Some preprocessing may be required earlier:
df$Time<-strptime(x = df$Time,format = "%H:%M:%S")
df$Time<-strftime(x = df$Time,format = "%H:%M:%S")
df$Time<-as.POSIXct(df$Time)
sol<-function(d){
Time_difference<-numeric(nrow(d))
ind<-which(diff(d$Flag)!=0)+1
#calculate differences in time where change in Flag was detected
Time_difference[ind]<-abs(difftime(time1 = d$Time[ind],time2 =
d$Time[(ind-1)], units = "secs"))
d$Time_Difference<-Time_difference
return(d)
}
Now using the plyr package and ddply function, which follow the split-apply-combine principle. It will take a data frame(d) and split it by a variable("User" in this case), apply a function(sol in this case) to that subset of data.frame and then recombine it to the original data.frame(d).
ddply(.data = df,.variables = "User",.fun = sol)
# User Time Flag Time_Difference
#1 A 11:39:30 1 0
#2 A 11:37:53 1 0
#3 A 20:44:19 1 0
#4 A 22:58:42 2 8063
#5 A 23:01:54 1 192
#6 B 23:03:00 1 0
#7 B 23:03:33 1 0
#8 B 23:03:53 1 0
#9 B 15:00:42 3 28991
#10 B 19:35:31 2 16489
#11 B 19:35:34 1 3
#12 C 10:19:06 1 0
#13 C 10:59:50 1 0
#14 C 10:59:50 1 0
#15 C 12:16:36 1 0
#16 C 12:16:36 1 0

Related

Filling (NA values) in the column based on its previous records and another column (with interval) in R

I want to fill action column based on its records and time column. NA in action column should be filled based on previous action record and time interval. lets say we set time interval to 10, which means that if action is A and time is 1, all NA in action should be A till time==11 (1+10).
Please note that if action or ID change, this process should be reset. For example (in row 3) we have B with time==11, I want to fill the next NAs with B until time==21, but we have C in time==16, so we continue filling NA with C until time==26.
df<-read.table(text="
id action time
1 A 1
1 NA 4
1 NA 9
1 B 11
1 NA 12
1 C 16
1 NA 19
1 NA 30
1 A 31
1 NA 32
2 NA 1
2 A 2
2 NA 6",header=T,stringsAsFactors = F)
Desired Result:
id action time
1 A 1
1 A 4
1 A 9
1 B 11
1 B 12
1 C 16
1 C 19
1 NA 30
1 A 31
1 A 32
2 NA 1
2 A 2
2 A 6
We can extract the non-NA rows to use as a reference for filling in values, then iterate through the data set and conditionally replace values based on if they meet the requirements of id and the time interval.
# Use row numbers as an index (unique Id)
df$idx <- 1:nrow(df)
# Find the non-NA rows to use a reference for imputation
idx <- df %>%
group_by(id) %>%
na.omit(action)
The temporary data set idx is used as the reference and the column idx is our unique identifier. Let's first look at the logic for finding and filling in the missing values without worrying about the time interval, so that it's easier to read and understand:
# Ignoring the 'interval' limitation, we'd fill them in like this:
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
If you're running this example code make sure you re-create df and idx before proceeding, since it would be modified by that last example code block.
The time interval requires us to do a logical test on the value of time and also another test to avoid trying to conduct the time comparison on NA values:
# Accounting for the max interval:
interval <- 10
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
if(!is.na(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]))){
if(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]) + interval >= df$time[r])
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
}
df
This gives us:
id action time idx
1 1 A 1 1
2 1 A 4 2
3 1 A 9 3
4 1 B 11 4
5 1 B 12 5
6 1 C 16 6
7 1 C 19 7
8 1 <NA> 30 8
9 1 A 31 9
10 1 A 32 10
11 2 <NA> 1 11
12 2 A 2 12
13 2 A 6 13
which matches your desired output.

Subset dataframe based of non-sequential dates

I have data that looks like this
df<-data.frame(datecol=as.Date(c("2010-04-03","2010-04-04","2010-04-05","2010-04-06","2010-04-07",
"2010-04-03","2010-04-04","2010-04-05","2010-04-06","2010-04-07",
"2010-05-06","2010-05-07","2010-05-09","2010-06-06","2010-06-07")),x=c(1,1,1,0,1,1,1,0,0,0,1,0,0,0,1),type=c(rep("A",5),rep("B",5),rep("C",5)))
> df
datecol x type
1 2010-04-03 1 A
2 2010-04-04 1 A
3 2010-04-05 1 A
4 2010-04-06 0 A
5 2010-04-07 1 A
6 2010-04-03 1 B
7 2010-04-04 1 B
8 2010-04-05 0 B
9 2010-04-06 0 B
10 2010-04-07 0 B
11 2010-05-06 1 C
12 2010-05-07 0 C
13 2010-05-09 0 C
14 2010-06-06 0 C
15 2010-06-07 1 C
I need to subset this dataframe by type, where I only keep the "types" which have 2 or more different dates and those dates are at least 1 day apart. In the above example type A has 4 different dates, and type C has 2 different dates which are more than 1 day apart, so I want to save these two as a new dataframe. Type B has 2 different dates, but they are not 1 day apart, so I don't want to keep it.
I was thinking to do it in a loop count how many unique date are within each type, leave everything which has more than 2 different dates. Then I would look at the ones which have only 2 different dates and calculate the distance between them and leave only the ones where distance is more than 1. But it seems that there should be a more efficient way. Any Ideas?
One solution with data.table:
#make sure datecol is Date
df$datecol <- as.Date(df$datecol)
library(data.table)
#x needs to be 1 and the date difference more than a day per type
#then in the second [] we select the TRUEs
setDT(df)[x == 1, diff(datecol) > 1, by = type][V1 == TRUE, type]
#[1] A C
#Levels: A B C

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Quantitative Time-series data, duration of states

I have a longitudinal dataset with a time variable and a qualitative variable. My subject can be in one of three states, sometimes the state changes, sometimes it stays the same.
What I would like to produce is a new dataframe which gives me, for every time a subject is in a state, at what time it first was in that state and how long the subject stayed in that same state. I want to do this because my end goal is to see whether state-switches occur more/less often for different treatments, length of states differ per state, length of states changes over time, etcetera.
Example data:
set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))
The first few lines of Data look like this
time State
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
6 6 c
7 7 c
I would like to produce this:
StartTime State Duration
1 1 a 1
2 2 b 2
3 4 c 1
4 5 a 1
5 6 c 2
I can probably achieve this with a while-loop but this seems highly inefficient, especially since my actual data is 700000 lines per subject. Is there a better way to do it? Maybe something with the diff-function and %in%. I can't figure it out.
set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))
Use data.table with data of that size:
library(data.table)
setDT(Data)
head(Data)
# time State
#1: 1 a
#2: 2 b
#3: 3 b
#4: 4 c
#5: 5 a
#6: 6 c
Give each state run a number:
Data[, state_run := cumsum(c(TRUE, diff(as.integer(Data$State)) != 0L))]
#Note that this assumes that State is a factor variable
Find the values of interest for each state run:
Data2 <- Data[, list(StartTime = min(time),
State = State[1],
Duration = diff(range(time)) + 1), by = state_run]
head(Data2)
# state_run StartTime State Duration
#1: 1 1 a 1
#2: 2 2 b 2
#3: 3 4 c 1
#4: 4 5 a 1
#5: 5 6 c 2
#6: 6 8 b 2

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

Resources