Transforming longitudinal data for time-to-event analysis in R - r

I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?

We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1

Related

Continual summation of a column in R until condition is met

I am doing my best to learn R, and this is my first post on this forum.
I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:
x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)
df <- data.frame("x" = x, "counter" = 0)
x counter
1 NA 0
2 1 0
3 0 0
4 0 0
5 0 0
6 0 0
7 1 0
8 1 0
9 1 0
10 1 0
11 0 0
12 1 0
I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:
x counter
1 NA NA
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 1
8 1 2
9 1 3
10 1 4
11 0 0
12 1 1
I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.
Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.
You can use :
library(dplyr)
df %>%
group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
mutate(counter = (row_number() - 1) * x) %>%
ungroup %>%
select(-x1)
# x counter
# <dbl> <dbl>
# 1 NA NA
# 2 1 1
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
# 7 1 1
# 8 1 2
# 9 1 3
#10 1 4
#11 0 0
#12 1 1
Explaining the steps -
Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.
Welcome #cpanagakos.
In dplyr::lag it's not posibble to use a column that still doesn't exist.
(It can't refer to itself.)
https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/
For example:
library(tidyverse)
df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))
# error: lag cannot refer to a column that still doesn't exist
df %>%
mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
x == 0 ~ 0,
x == 1 ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.
So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like #Ronald Shah and others suggest):
Create sequential counter that restarts on a condition within panel data groups
df %>%
group_by(x1 = cumsum(!coalesce(x, 0))) %>%
mutate(counter = row_number() - 1) %>%
ungroup()
#> # A tibble: 12 x 3
#> x x1 counter
#> <dbl> <int> <dbl>
#> 1 NA 1 NA
#> 2 1 1 1
#> 3 0 2 0
#> 4 0 3 0
#> 5 0 4 0
#> 6 0 5 0
#> 7 1 5 1
#> 8 1 5 2
#> 9 1 5 3
#> 10 1 5 4
#> 11 0 6 0
#> 12 1 6 1
This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.

How to convert the result of xtabs() into dataframe in R? [duplicate]

This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed 4 years ago.
I have data like dataframe df_a, and want to have it converted to the format as in dataframe df_b.
xtabs() gives similar result, but I did not find a way to access elements as in the example code below. Accessing through xa[1,1] gives no advantage since there is a weak correlation between indexing by numbers ("1") and names ("A"). As you can see there is a sort difference in the xtabs() result, so xa[2,2]=2 and not 0 as on the df_b listing.
> df_a
ItemName Feature Amount
1 First A 2
2 First B 3
3 First A 4
4 Second C 3
5 Second C 2
6 Third D 1
7 Fourth B 2
8 Fourth D 3
9 Fourth D 2
> df_b
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
> df_b$A
[1] 6 0 0 0
> xa<-xtabs(df_a$Amount~df_a$ItemName+df_a$Feature)
> xa
df_a$Feature
df_a$ItemName A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
> xa$A
Error in xa$A : $ operator is invalid for atomic vectors
There is a way of iterative conversion with for() loops, but totally inefficient in my case because my data has millions of records.
For the purpose of further processing my required output format is dataframe.
If anyone solved similar problem please share.
You can just use as.data.frame.matrix(xa)
# output
A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
## or
df_b <- as.data.frame.matrix(xa)[unique(df_a$ItemName), ]
data.frame(ItemName = row.names(df_b), df_b, row.names = NULL)
# output
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
Without using xtabs you can do something like this:
df %>%
dplyr::group_by(ItemName, Feature) %>%
dplyr::summarise(Sum=sum(Amount, na.rm = T)) %>%
tidyr::spread(Feature, Sum, fill=0) %>%
as.data.frame()
This will transform as you require and it stays as a data.frame
Or, you can just as.data.frame(your_xtabs_result) and that should work too

R diagonal matrix error

I have the following type of dataframe
A B C D
1 0 1 10
0 2 1 15
1 1 0 11
I would like the following output
A B C D
1 0 1 10
1 1 0 11
0 2 1 15
I have tried this code
require(permute)
z <- apply(permute::allPerms(1:nrow(DF)), 1, function(x){
mat <- as.matrix(DF,2:ncol(DF)])
if(all(diag(mat[x,]) == rep(1,nrow(DF)))){
return(df[x,])} })
I am unable to get the desired output.
(Link for the above code- Arrange data frame in a specific way)
I request someone to guide me. The dataframe is a small sample but I have a huge one with a similar structure.
The following will work so long as there is at least one 1 in every suitable column. It's deterministic so will always just find the first 1 and swap that with the number in the diagonal position. But no combinatorial explosion. Perhaps someone can find a more elegant (or vectorised) solution???
fn<- function(colm){
i1<-match(1, colm)
colm[i1]<- colm[i]
colm[i]<-1
return(colm)
}
for(i in 1:nrow(DF))
{
DF[,i]=fn(DF[,i])
}
EDIT
Although this answer was accepted (so I cannot delete) when rereading it I don't think it does quite what you asked...
The folowing code should fix this answer..
DF<-read.table(text="A B C D
13 0 0 1
1 0 1 10
0 2 1 15
1 1 0 11", header=T)
rem<-1:nrow(DF)
for(i in 1:nrow(DF))
{
temp<-DF[i,]
any1<-intersect(rem, which(DF[,i]==1))
best1<-which.min(rowSums(DF[any1,]==1))
firsti<-any1[best1]
DF[i,]<-DF[firsti,]
DF[firsti,]<-temp
rem<-setdiff(rem, i)
}
DF
A B C D
1 1 0 1 10
2 1 1 0 11
3 0 2 1 15
4 13 0 0 1
My apologies for confusion.

R: How to sum two separate values of two variables?

I have data 7320 obs of 3 variables: age groups and contact number between them. Ex:
ageGroup ageGroup1 mij
0 0 0.012093847617507
0 1 0.00510485237464309
0 2 0.00374919082969427
0 3 0.00307241431437433
0 4 0.00254487083293498
0 5 0.00213734013959765
0 6 0.00182565778959543
0 7 0.00159036659169942
1 0 0.00475097494199872
1 1 0.00748329237103462
1 2 0.00427123298868537
1 3 0.00319622224196792
1 4 0.00287522072903812
1 5 0.00257773394696414
1 6 0.00230322568677366
1 7 0.00205265986733139
and so on until 86. I have to calculate mean of contact number (mij) between ageGroups so that, for example, ageGroup = 0 contacts with ageGroup1 =1 with mij and ageGroup = 1 contacts with ageGroup1 = 0 with mij. I need to sum this values and divide by 2 to get an average between then. Would you be so kind to give me a hint how to do that all over the data?
Use ddply from plyr package (assuming your dataframe is data)
ddply(data,.(ageGroup,ageGroup1),summarize,sum.mij=sum(mij))
ageGroup ageGroup1 sum.mij
1 0 0 0.012093848
2 0 1 0.005104852
3 0 2 0.003749191
4 0 3 0.003072414
5 0 4 0.002544871
6 0 5 0.002137340
7 0 6 0.001825658
8 0 7 0.001590367
9 1 0 0.004750975
10 1 1 0.007483292
11 1 2 0.004271233
12 1 3 0.003196222
13 1 4 0.002875221
14 1 5 0.002577734
15 1 6 0.002303226
16 1 7 0.002052660
I think I see what you're trying to do here. You want to treat interactions between the two ageGroup columns as being non-directional and get the mean interaction? The code below should do this using base R functions.
Note that since the example dataset is truncated, it will only give a correct answer for the group with index 01. However if you run with the full dataset, it should work for all interactions.
# Create the data frame
df=read.table(header=T,text="
ageGroup,ageGroup1,mij
0,0,0.012093848
0,1,0.005104852
0,2,0.003749191
0,3,0.003072414
0,4,0.002544871
0,5,0.00213734
0,6,0.001825658
0,7,0.001590367
1,0,0.004750975
1,1,0.007483292
1,2,0.004271233
1,3,0.003196222
1,4,0.002875221
1,5,0.002577734
1,6,0.002303226
1,7,0.00205266
",sep=",")
df
# Using the strSort function from this SO answer:
# http://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string-in-r
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
# Label each of the i-j interactions and j-i interactions with an index ij
# e.g. anything in ageGroup=1 interacting with ageGroup1=0 OR ageGroup=0 interacting with ageGroup1=1
# are labelled with index 01
df$ind=strSort(paste(df$ageGroup,df$ageGroup1,sep=""))
# Use the tapply function to get mean interactions for each group as suggested by Paul
tapply(df$mij,df$ind,mean)

Filtering data without loops in R

I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3

Resources