I have a longitudinal dataset with a time variable and a qualitative variable. My subject can be in one of three states, sometimes the state changes, sometimes it stays the same.
What I would like to produce is a new dataframe which gives me, for every time a subject is in a state, at what time it first was in that state and how long the subject stayed in that same state. I want to do this because my end goal is to see whether state-switches occur more/less often for different treatments, length of states differ per state, length of states changes over time, etcetera.
Example data:
set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))
The first few lines of Data look like this
time State
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
6 6 c
7 7 c
I would like to produce this:
StartTime State Duration
1 1 a 1
2 2 b 2
3 4 c 1
4 5 a 1
5 6 c 2
I can probably achieve this with a while-loop but this seems highly inefficient, especially since my actual data is 700000 lines per subject. Is there a better way to do it? Maybe something with the diff-function and %in%. I can't figure it out.
set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))
Use data.table with data of that size:
library(data.table)
setDT(Data)
head(Data)
# time State
#1: 1 a
#2: 2 b
#3: 3 b
#4: 4 c
#5: 5 a
#6: 6 c
Give each state run a number:
Data[, state_run := cumsum(c(TRUE, diff(as.integer(Data$State)) != 0L))]
#Note that this assumes that State is a factor variable
Find the values of interest for each state run:
Data2 <- Data[, list(StartTime = min(time),
State = State[1],
Duration = diff(range(time)) + 1), by = state_run]
head(Data2)
# state_run StartTime State Duration
#1: 1 1 a 1
#2: 2 2 b 2
#3: 3 4 c 1
#4: 4 5 a 1
#5: 5 6 c 2
#6: 6 8 b 2
Related
I have the following data set
(this is just sample, actual data set runs into rows)
Image of the data set in also attached in the snapshot
Dataset snapshot
User Time Flag TimeDifference Expected o/p (Seconds)
A 11:39:30 1
A 11:37:53 1
A 20:44:19 1
A 22:58:42 2 Calculate time difference? 8063
A 23:01:54 1 Calculate time difference? 192
B 23:03:00 1
B 23:03:33 1
B 23:03:53 1
B 15:00:42 3 Calculate time difference 28991
B 19:35:31 2 Calculate time difference 16489
B 19:35:34 1 Calculate time difference 3
C 10:19:06 1
C 10:59:50 1
C 10:59:50 1
C 12:16:36 1
C 12:16:36 1
I need to calculate for each user
the time difference (in seconds) between rows whenever there is a 'Flag change' and store it in a new column called 'Time difference'
i.e. whenever flag changes from 1 to 2 , or 2 to 3 , or 2 to 1, or 3 to 1 , I need to compute time difference the time column between the current row and preceding row when flag change in encountered.
I have time in hh:mm:ss format.
Is there any for-loop function I can apply here?
Help appreciated.
One way to do that is to turn your time variable into POSIXlt time object, calculate the time difference (for all rows) against a shifted time variable. Then use your flag variable to NA the ones you dont want. The important part is you need to diff the flag variable so you know when your flag has changed
I'm laying out all the steps here, but theres probably a quicker way to do it:
# Create the data
flag <- c(1,1,1,2,1,1,1,1,3,2,1,1,1,1,1,1)
time <- c('11:39:30','11:37:53','20:44:19','22:58:42','23:01:54',
'23:03:00','23:03:33','23:03:53','15:00:42','19:35:31',
'19:35:34','10:19:06','10:59:50','10:59:50','12:16:36',
'12:16:36')
# Shift the time
time_shift <- c(NA,time[1:length(time)-1])
# Turn into POSIXlt objects
time <- strptime(time, format='%H:%M:%S')
time_shift <- strptime(time_shift, format='%H:%M:%S')
data <- data.frame(time, time_shift, flag)
# Calculate diffs
data$time_diff <- as.numeric(abs(difftime(data$time, data$time_shift, units=('secs'))))
data$flag_diff <- c(NA,abs(diff(data$flag)))
# Set non 'flag change' diffs to NA
data$time_diff[data$flag_diff == 0] <- NA
You'll probably want to remove the useless columns and convert time back into your original representation, which you can do with:
data$time <- format(data$time, "%H:%M:%S")
data <- data[c('time', 'flag', 'time_diff')]
That will result in a dataframe that looks like this:
time flag time_diff
1 11:39:30 1 NA
2 11:37:53 1 NA
3 20:44:19 1 NA
4 22:58:42 2 8063
5 23:01:54 1 192
6 23:03:00 1 NA
7 23:03:33 1 NA
8 23:03:53 1 NA
9 15:00:42 3 28991
10 19:35:31 2 16489
11 19:35:34 1 3
12 10:19:06 1 NA
13 10:59:50 1 NA
14 10:59:50 1 NA
15 12:16:36 1 NA
16 12:16:36 1 NA
Some preprocessing may be required earlier:
df$Time<-strptime(x = df$Time,format = "%H:%M:%S")
df$Time<-strftime(x = df$Time,format = "%H:%M:%S")
df$Time<-as.POSIXct(df$Time)
sol<-function(d){
Time_difference<-numeric(nrow(d))
ind<-which(diff(d$Flag)!=0)+1
#calculate differences in time where change in Flag was detected
Time_difference[ind]<-abs(difftime(time1 = d$Time[ind],time2 =
d$Time[(ind-1)], units = "secs"))
d$Time_Difference<-Time_difference
return(d)
}
Now using the plyr package and ddply function, which follow the split-apply-combine principle. It will take a data frame(d) and split it by a variable("User" in this case), apply a function(sol in this case) to that subset of data.frame and then recombine it to the original data.frame(d).
ddply(.data = df,.variables = "User",.fun = sol)
# User Time Flag Time_Difference
#1 A 11:39:30 1 0
#2 A 11:37:53 1 0
#3 A 20:44:19 1 0
#4 A 22:58:42 2 8063
#5 A 23:01:54 1 192
#6 B 23:03:00 1 0
#7 B 23:03:33 1 0
#8 B 23:03:53 1 0
#9 B 15:00:42 3 28991
#10 B 19:35:31 2 16489
#11 B 19:35:34 1 3
#12 C 10:19:06 1 0
#13 C 10:59:50 1 0
#14 C 10:59:50 1 0
#15 C 12:16:36 1 0
#16 C 12:16:36 1 0
I have a data frame that looks a bit like this:
wt <- data.frame(region = c(rep("A", 5), rep("B", 5)), time = c(1:5, 1:5),
start = c(rep(2,5), rep(4, 5)), value = rep(1, 10))
The values in the value column could be any numbers (I am working in a very large data set), but each region will be over an equal-length time series and have a single starting point.
I want to perform a cumulative sum within each region that begins accumulating at the starting point, continues forward in the time series, and wraps to the rows before the starting point in the time series.
The full data table, WITH the intended result, would look like this:
region time start value result
A 1 2 1 5
A 2 2 1 1
A 3 2 1 2
A 4 2 1 3
A 5 2 1 4
B 1 4 1 3
B 2 4 1 4
B 3 4 1 5
B 4 4 1 1
B 5 4 1 2
A simple transformation of the time column followed by cumsum does not work, since the function cares about row order and not any particular factor.
With that in mind, I am operating on a huge data table, and runtime is absolutely a concern, so any solution must avoid re-ordering rows.
Ideas of how to do this? Thanks in advance.
EDIT: Consider time to be a cycle such as hours in a day - and for example, if the start time is 2, that means observations start at one instance of time 2 and end at the next time 1.
We can do this in an efficient way with data.table
library(data.table)
setDT(wt)[time>=start, result := seq_len(.N), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +seq_len(.N) , region][, Max := NULL][]
# region time start value result
#1: A 1 2 1 5
#2: A 2 2 1 1
#3: A 3 2 1 2
#4: A 4 2 1 3
#5: A 5 2 1 4
#6: B 1 4 1 3
#7: B 2 4 1 4
#8: B 3 4 1 5
#9: B 4 4 1 1
#10: B 5 4 1 2
akrun's solution works for the example I gave (hence I accepted it as the answer), but here's a version that works for any values in the value column:
library(data.table)
setDT(wt)[time>=start, result := cumsum(value), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +cumsum(value) , region][, Max := NULL][]
Just adding the... unfortunately named cumsum function in place of a calculated sequence.
This question describes the setting for my question pretty well.
Instead of a second value however, I have a factor called algorithm. My data frame looks like the following (note the possibility of multiplicity of values even within their group):
algorithm <- c("global", "distributed", "distributed", "none", "global", "global", "distributed", "none", "none")
v <- c(5, 2, 6, 7, 3, 1, 10, 2, 2)
df <- data.frame(algorithm, v)
df
algorithm v
1 global 5
2 distributed 2
3 distributed 6
4 none 7
5 global 3
6 global 1
7 distributed 10
8 none 2
9 none 2
I would like to sort the dataframe by v but get the ordering position for every entry with respect to its group (algorithm). This position should then be added to the original data frame (so I don't need to rearrange it) because I would like to plot the calculated position as x and the value as y using a ggplot (grouped by algorithm, e.g. every algorithm is one set of points).
So the result should look like this:
algorithm v groupIndex
1 global 5 3
2 distributed 2 1
3 distributed 6 2
4 none 7 3
5 global 3 2
6 global 1 1
7 distributed 10 3
8 none 2 1
9 none 2 2
So far I know I can order the data by algorithm first and then by value or the other way round. I guess in a second step I would have to calculate the index within each group? Is there an easy way to do that?
df[order(df$algorithm, df$v), ]
algorithm v
2 distributed 2
3 distributed 6
7 distributed 10
6 global 1
5 global 3
1 global 5
8 none 2
9 none 2
4 none 7
Edit: It is not guaranteed, that there is the same amount of entries for each group!
A double application of order in each group should cover it:
ave(df$v, df$algorithm, FUN=function(x) order(order(x)) )
#[1] 3 1 2 3 2 1 3 1 2
Which is also equivalent to:
ave(df$v, df$algorithm, FUN=function(x) rank(x,ties.method="first") )
#[1] 3 1 2 3 2 1 3 1 2
, which in turn means you can take advantage of frank from data.table if you are concerned about speed:
setDT(df)[, grpidx := frank(v,ties.method="first"), by=algorithm]
df
# algorithm v grpidx
#1: global 5 3
#2: distributed 2 1
#3: distributed 6 2
#4: none 7 3
#5: global 3 2
#6: global 1 1
#7: distributed 10 3
#8: none 2 1
#9: none 2 2
One way would be the following. You can order v values for each group by using with_order(), I think. You can assign ranks using row_number() in the function. In this way, you can skip a step to arrange your data for each group as you tried with order().
library(dplyr)
group_by(df, algorithm) %>%
mutate(groupInd = with_order(order_by = v, fun = row_number, x = v))
# algorithm v groupInd
# <fctr> <int> <int>
#1 global 5 3
#2 distributed 2 1
#3 distributed 6 2
#4 none 7 3
#5 global 3 2
#6 global 1 1
#7 distributed 10 3
#8 none 2 1
#9 none 2 2
I have a two-level dataset (let's say classes nested within schools) and the dataset was coded
Like this:
School Class
A 1
A 1
A 2
A 2
B 1
B 1
B 2
B 2
But to run an analysis I need the data to have a unique Class ID, regardless of school membership.
School Class NewClass
A 1 1
A 1 1
A 2 2
A 2 2
B 1 3
B 1 3
B 2 4
B 2 4
I tried using transform and ddply, but I'm not sure how to keep NewClass continually incrementing to a larger number for each combination of School and Class. I can think of a few inelegant ways to do this, but I'm sure there are much easy solutions I just can't think of right now. Any help would be appreciated!
using interaction to create a factor, and then coerce it to integer:
transform(dat,nn = as.integer(interaction(Class,School)))
School Class nn
1 A 1 1
2 A 1 1
3 A 2 2
4 A 2 2
5 B 1 3
6 B 1 3
7 B 2 4
8 B 2 4
Using data.table:
library(data.table)
dt = as.data.table(your_df)
dt[, NewClass := .GRP, by = list(School, Class)]
dt
# School Class NewClass
#1: A 1 1
#2: A 1 1
#3: A 2 2
#4: A 2 2
#5: B 1 3
#6: B 1 3
#7: B 2 4
#8: B 2 4
.GRP is simply a group counter. Also note that you don't really need to do this and can keep using the above combination list(School, Class) in whatever by operation you need to do.
Note that from data.table versions >= 1.9.0, a function setDT is exported that converts a data.frame to data.table by reference (no copy is made), in case you'd want to stick to data.tables.
require(data.table) ## >= 1.9.0
setDT(your_df) ## your_df is now a data.table, changed by reference.
The data comes from another question I was playing around with:
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
# user country event
#1: 3 1 1
#2: 3 1 2
#3: 3 1 3
#4: 3 1 4
#5: 3 2 5
#6: 4 2 6
#7: 4 2 7
#8: 4 2 8
#9: 4 2 9
#10: 4 2 10
And here's the surprising behavior:
dt[user == 3, as.data.frame(table(country))]
# country Freq
#1 1 4
#2 2 1
dt[user == 4, as.data.frame(table(country))]
# country Freq
#1 2 5
dt[, as.data.frame(table(country)), by = user]
# user country Freq
#1: 3 1 4
#2: 3 2 1
#3: 4 1 5
# ^^^ - why is this 1 instead of 2?!
Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected
dt[, blah, by = user]
to return identical result to
rbind(dt[user == 3, blah], dt[user == 4, blah])
Is that expectation incorrect?
The idiomatic data.table approach is to use .N
dt[ , .N, by = list(user, country)]
This will be far quicker and it will also retain country as the same class as in the original.
As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.
What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:
> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
user country Freq
1: 3 1 4
2: 3 2 1
3: 4 2 5
Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:
> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3