I am working on a "wide" dataset, and now I would like to use a specific package (-msSurv-, for non-parametric multistate models) which requires data in interval form.
My current dataset is characterized by one row for each individual:
dat <- read.table(text = "
id cohort t0 s1 t1 s2 t2 s3 t3
1 2 0 1 50 2 70 4 100
2 1 0 2 15 3 100 0 0
", header=TRUE)
where cohort is a time-fixed covariate, and s1-s3 correspond to the values that a time-varying covariate s = 1,2,3,4 takes over time (they are the distinct states visited by the individual over time). Calendar time is defined by t1-t3, and ranges from 0 to 100 for each individual.
So, for instance, individual 1 stays in state = 1 up to calendar time = 50, then he stays in state = 2 up to time = 70, and finally he stays in state = 4 up to time 100.
What I would like to obtain is a dataset in "interval" form, that is:
id cohort t.start t.stop start.s end.s
1 2 0 50 1 2
1 2 50 70 2 4
1 2 70 100 4 4
2 1 0 15 2 3
2 1 15 100 3 3
I hope the example is sufficiently clear, otherwise please let me know and I will try to further clarify.
How would you automatize this reshaping? Consider that I have a relatively large number of (simulated) individuals, around 1 million.
Thank you very much for any help.
I think I understand. Does this work?
require(data.table)
dt <- data.table(dat, key=c("id", "cohort"))
dt.out <- dt[, list(t.start=c(t0,t1,t2), t.stop=c(t1,t2,t3),
start.s=c(s1,s2,s3), end.s=c(s2,s3,s3)),
by = c("id", "cohort")]
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 0
# 6: 2 1 100 0 0 0
If the output you show is indeed right and is what you require, then you can obtain with two more lines (not the best way probably, but it should nevertheless be fast)
# remove rows where start.s and end.s are both 0
dt.out <- dt.out[, .SD[start.s > 0 | end.s > 0], by=1:nrow(dt.out)]
# replace end.s values with corresponding start.s values where end.s == 0
# it can be easily done with max(start.s, end.s) because end.s >= start.s ALWAYS
dt.out <- dt.out[, end.s := max(start.s, end.s), by=1:nrow(dt.out)]
dt.out[, nrow:=NULL]
> dt.out
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 3
Related
I would like to divide the value of a column according to a condition.
Something like the following:
Data$ColumnA [Data$ColumnA>50 && Data$ColumnB> 0] <- Data$ColumnA / 25
The problem is Data$ColumnA / 25 loses the "index", and makes the division of the first value in the list.
Thank you
I prefer using a data.table instead of a data.frame not only for performance reasons but also because the syntax is more compact:
library(data.table)
Data <- data.frame(ColumnA = seq(0, 175, by = 25),
ColumnB = c(0, 1))
Data
# ColumnA ColumnB
# 1 0 0
# 2 25 1
# 3 50 0
# 4 75 1
# 5 100 0
# 6 125 1
# 7 150 0
# 8 175 1
setDT(Data) # "convert" data.frame into a data.table
Data[ColumnA > 50 & ColumnB > 0, ColumnA := ColumnA / 25]
Data
# ColumnA ColumnB
# 1: 0 0
# 2: 25 1
# 3: 50 0
# 4: 3 1
# 5: 100 0
# 6: 5 1
# 7: 150 0
# 8: 7 1
How do I perform a conditional aggregation by ID without including the max date and criteria of specific date range such as date minus certain days
QUESTIONS # A.1, A.2 and B:
INPUT Data = o2i
ID date event_p event_b
1 8/7/2016 1 0
1 8/1/2016 1 0
1 8/1/2016 1 1
2 7/28/2016 1 0
2 8/7/2016 1 1
2 7/29/2016 1 1
3 7/10/2016 1 0
3 7/7/2016 1 1
3 7/14/2016 1 1
4 8/24/2016 1 1
4 8/26/2016 1 1
Solution A.1) I would like to restrict the sum to only add up count of those events that have happened in 7 Days before the date mentioned
in the Date Column (per user ID).
Note: NOT including the Date (but go back upto 7 days before the Date).
Note: if record of (Date-7 Days) does not exist then still the logic is same.
A.1 - OUTPUT Expected:
ID date event_p
1 8/7/2016 2
1 8/1/2016 0
2 7/28/2016 0
2 8/7/2016 0
2 7/29/2016 1
3 7/10/2016 1
3 7/7/2016 0
3 7/14/2016 2
4 8/24/2016 0
4 8/26/2016 1
Note: Here 8/1/2016 had two rows in input file (on same date) but in Output it is displayed as one. This is preferred, but if two rows are displayed, that is also fine.
Solution A.2) Instead of summing by event_p (for past 7 days), and event_b separate, Is there a way there is a way to write code so that both event_p and event_b are aggregated, as per the same logic ?
A.2 - OUTPUT Expected:
ID date event_p event_b
1 8/7/2016 2 1
1 8/1/2016 0 0
2 7/28/2016 0 0
2 8/7/2016 0 0
2 7/29/2016 1 0
3 7/10/2016 1 1
3 7/7/2016 0 0
3 7/14/2016 2 1
4 8/24/2016 0 1
4 8/26/2016 1 1
Note: Here 8/1/2016 had two rows in input file (on same date) but in Output it is displayed as one. This is preferred, but if two rows are displayed, that is also fine.
Solution B) I would like to sum to count of those events that have happened before the date mentioned in the Date Column (per user ID).
Note: NOT including the Date (but go back upto "all" days before the Date)
Note: if record of (Date-7 Days) does not exist then still the logic is same.
WHAT I TRIED:
I have researched and looked at this site, and tried to write the code almost for a week and closest i could get is via this codes.
My TRY A.1:
# convert factor to POSIXlt
o2i$date <- as.POSIXlt(o2i$date, format="%m/%d/%Y")
class(o2i$date)
o2i$date
o2i
# convert factor to date
o2i$date <- as.Date(o2i$date)
class(o2i$date)
# Aggregation Option 1
cum7_event_p <- aggregate(event_p~ID+date, subset(o2i, date < max(o2i$date) & date >= (o2i$date)-7),sum)
cum7_event_p
# Aggregation Option 2
cum7_event_p <- aggregate(event_p~ID+date, subset(o2i, date < max((o2i$date)-1) & date >= (o2i$date)-7),sum)
cum7_event_p
WHAT I GOT FOR A.1)
ID date event_p
3 7/7/2016 1
3 7/10/2016 1
3 7/14/2016 1
2 7/28/2016 1
2 7/29/2016 1
1 8/1/2016 2
1 8/7/2016 1
2 8/7/2016 1
4 8/24/2016 1
Note: It is counting the events on the particular date as well.. For example, on 8/1/2016 it is showing sum of 2. But as per logic it should show count as "0" , because it is count of 7 days (before that date - not including the date) ...and on 8/7/2016 it should show count of 2.
My TRY A.2:
## All Event Aggregation ##
cum7 <- aggregate(o2i[,3:4], o2i[, c(1,2)], data=subset(o2i, date < max(o2i$date) & date >= (o2i$date)-7), sum)
# Error: Error in FUN(X[[i]], ...) : invalid 'type' (list) of argument
cum7 <- aggregate(o2i[,3:4], o2i[, c(1,2)], sum) # Does not include the Logic of "Calling the Date (every date - per ID) and calling it a Max Date, while counting)
cum7
WHAT I GOT FOR A.2
ID date event_p event_b
3 7/7/2016 1 1
3 7/10/2016 1 0
3 7/14/2016 1 1
2 7/28/2016 1 0
2 7/29/2016 1 1
1 8/1/2016 2 1
1 8/7/2016 1 0
2 8/7/2016 1 1
4 8/24/2016 1 1
4 8/26/2016 1 1
Note: I am not sure how to write the best code to incorporate the (NOT including the particular date - and summing for either 7 days before that date .. and/or...all dates before that date).
I hope I was clear in explaining my problem and expected output. If someone writes a function to solve, I would very thankful if you can please kindly write few more lines of explanation no how it works.
I can't find an elegant way of doing this. I'd just create a window function that takes date and events columns together, and outputs desired outcome. You can use dplyr to apply the function for each ID by doing group by.
library(lubridate)
library(dplyr)
myfun <- function(dates,events){
ct <- rep(0,length(dates))
for (i in 1:length(dates)){
ct[i] <- sum(events[between(dates,dates[i]-7,dates[i]-1)])
}
return(ct)
}
dt <- read.table('testdata',head=T)
output <- dt %>%
mutate(date = as.Date(parse_date_time(date,c('mdy')))) %>%
group_by(ID) %>%
mutate(summary_event_p = myfun(date,event_p), summary_event_b = myfun(date,event_b)) %>%
ungroup() %>%
distinct(ID,date,summary_event_p,summary_event_b)
# # A tibble: 10 × 4
# ID date summary_event_p summary_event_b
# <int> <date> <dbl> <dbl>
# 1 1 2016-08-07 2 1
# 2 1 2016-08-01 0 0
# 3 2 2016-07-28 0 0
# 4 2 2016-08-07 0 0
# 5 2 2016-07-29 1 0
# 6 3 2016-07-10 1 1
# 7 3 2016-07-07 0 0
# 8 3 2016-07-14 2 1
# 9 4 2016-08-24 0 0
# 10 4 2016-08-26 1 1
testdata file is just a text file of copied contents of your data. I used a lubridate function to just clean up the date format in your raw data.
I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I am currently working on a Multistate Analysis dataset in "long" form (one row for each individual's observation; each individual is repeatedly measured up to 5 times).
The idea is that each individual can recurrently transition across the levels of the time-varying state variable s = 1, 2, 3, 4. All the other variables that I have (here cohort) are fixed within any given id.
After some analyses, I need to reshape the dataset in "wide" form, according to the specific sequence of visited states. Here is an example of the initial long data:
dat <- read.table(text = "
id cohort s
1 1 2
1 1 2
1 1 1
1 1 4
2 3 1
2 3 1
2 3 3
3 2 1
3 2 2
3 2 3
3 2 3
3 2 4",
header=TRUE)
The final "wide" dataset should take into account the specific individual sequence of visited states, recorded into the newly created variables s1, s2, s3, s4, s5, where s1 is the first state visited by the individual and so on.
According to the above example, the wide dataset looks like:
id cohort s1 s2 s3 s4 s5
1 1 2 2 1 4 0
2 3 1 1 3 0 0
3 2 1 2 3 3 4
I tried to use reshape(), and also to focus on transposing s, but without the intended result. Actually, my knowledge of the R functions is quite limited.. Can you give any suggestion? Thanks.
EDIT: obtaining a different kind of wide dataset
Thank you all for your help, I have a related question if I can. Especially when each individual is observed for a long time and there are few transitions across states, it is very useful to reshape the initial sample dat in this alternative way:
id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4 dur5
1 1 2 1 4 0 0 2 1 1 0 0
2 3 1 3 0 0 0 2 1 0 0 0
3 2 1 2 3 4 0 1 1 2 1 0
In practice now s1-s5 are the distinct visited states, and dur1-dur5 the time spent in each respective distinct visited state.
Can you please give a hand for reaching this data structure? I believe it is necessary to create all the dur- and s- variables in an intermediate sample before using reshape(). Otherwise maybe it is possible to directly adopt -reshape2-?
dat <- read.table(text = "
id cohort s
1 1 2
1 1 2
1 1 1
1 1 4
2 3 1
2 3 1
2 3 3
3 2 1
3 2 2
3 2 3
3 2 3
3 2 4",
header=TRUE)
df <- data.frame(
dat,
period = sequence(rle(dat$id)$lengths)
)
wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
timevar = "period", direction = "wide")
wide[is.na(wide)] = 0
wide
Gives:
id cohort s.1 s.2 s.3 s.4 s.5
1 1 1 2 2 1 4 0
5 2 3 1 1 3 0 0
8 3 2 1 2 3 3 4
then using the following line gives your names:
names(wide) <- c('id','cohort', paste('s', seq_along(1:5), sep=''))
# id cohort s1 s2 s3 s4 s5
# 1 1 1 2 2 1 4 0
# 5 2 3 1 1 3 0 0
# 8 3 2 1 2 3 3 4
If you use sep='' in the wide statement you do not have to rename the variables:
wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
timevar = "period", direction = "wide", sep='')
I suspect there are ways to avoid creating the period variable and avoid replacing NA directly in the wide statement, but I have not figured those out yet.
ok...
library(plyr)
library(reshape2)
dat2 <- ddply(dat,.(id,cohort), function(x)
data.frame(s=x$s,name=paste0("s",seq_along(x$s))))
dat2 <- ddply(dat2,.(id,cohort), function(x)
dcast(x, id + cohort ~ name, value.var= "s" ,fill= 0)
)
dat2[is.na(dat2)] <- 0
dat2
# id cohort s1 s2 s3 s4 s5
# 1 1 1 2 2 1 4 0
# 2 2 3 1 1 3 0 0
# 3 3 2 1 2 3 3 4
This seem right? I admit the first ddply is hardly elegant.
Try this:
library(reshape2)
dat$seq <- ave(dat$id, dat$id, FUN = function(x) paste0("s", seq_along(x)))
dat.s <- dcast(dat, id + cohort ~ seq, value.var = "s", fill = 0)
which gives this:
> dat.s
id cohort s1 s2 s3 s4 s5
1 1 1 2 2 1 4 0
2 2 3 1 1 3 0 0
3 3 2 1 2 3 3 4
If you did not mind using just 1, 2, ..., 5 as column names then you could shorten the ave line to just:
dat$seq <- ave(dat$id, dat$id, FUN = seq_along)
Regarding the second question that was added later try this:
library(plyr)
dur.fn <- function(x) {
r <- rle(x$s)$length
data.frame(id = x$id[1], dur.value = r, dur.seq = paste0("dur", seq_along(r)))
}
dat.dur.long <- ddply(dat, .(id), dur.fn)
dat.dur <- dcast(dat.dur.long, id ~ dur.seq, c, value.var = "dur.value", fill = 0)
cbind(dat.s, dat.dur[-1])
which gives:
id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4
1 1 1 2 2 1 4 0 2 1 1 0
2 2 3 1 1 3 0 0 2 1 0 0
3 3 2 1 2 3 3 4 1 1 2 1
I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2