How do you prepare longitudinal data for survival analysis with various specifications? - r

I have a question regarding longitudinal study analysis and work with R.
I have the following data format:
ID Visit Behaviour Distance_to_first_visit_in_month
1 0 1 0
1 1 1 6
1 2 1 12
1 3 1 50
2 0 3 0
2 1 3 8
2 2 3 16
2 3 3 25
2 4 3 40
2 5 3 60
3 0 1 0
3 1 1 6
3 2 1 12
3 3 3 24
3 4 3 30
3 5 3 55
I need the data in the following format:
ID Visit Behaviour Distance_to_first_visit_in_month Status
1 0 1 0 0
2 0 3 0 1
3 3 3 24 1
If a person has 1 every time until the end he should be only censored because the study is finished. If a person has 3 for the first time I need the Distance_to_to_first_visit_in_month because there he has the status 1 in the Kapplan-Meyer curve.
I tried to filter the maximal Distance_to_first_visit_in_month and get the Behaviour. When I bring the data to the wide format it is easy to get those. But I can't get the Distance_to_first_visit_in_month when the person 3 as Behaviour at the beginning or when otherwise.
I have 300IDs with sometimes 11 visits so I can't prepare the data manuell.
Do you have an idea?
Thanks you in advance.
Best Christina

As you don't explain how to aggregate your data to the second dataset, I can only show you how to get the ID's that match your conditions and how to implement the status variable. See this example:
library(dplyr)
# get id's with only 1
id_list1 <- lapply(df %>% split(.$ID),function(x){
if(unique(x$ID)==1){
return(unique(x$ID))
}
}) %>%
unlist()
# get id's with 3 as first value
id_list3 <- lapply(df %>% split(.$ID),function(x){
if(x[x$Visit==0,"Behaviour"]==3){
return(unique(x$ID))
}
}) %>%
unlist()
df %>%
mutate(Status = ifelse(ID %in% id_list3,1,0)) %>%
mutate(new_dist = ifelse(!ID %in% id_list3,Distance_to_first_visit_in_month,NA))
Please note that you'll get named vectors in id_list1 and id_list3. There are no duplicates, just the name of the element matching the element.
And do you mean Visit number 0 with "at the beginning"? Otherwise you'll have to adjust x$Visit==0.

Related

Sort across rows to obtain three largest values

There is a injury score called ISS score
I have a table of injury data in rows according to pt ID.
I would like to obtain the top three values for the 6 injury columns.
Column values range from 0-5.
pt_id head face abdo pelvis Extremity External
1 4 0 0 1 0 3
2 3 3 5 0 3 2
3 0 0 2 1 1 1
4 2 0 0 0 0 1
5 5 0 0 2 0 1
My output for the above example would be
pt-id n1 n2 n3
1 4 3 1
2 5 3 3
3 2 1 1
4 2 1 0
5 5 2 1
values can be in a list or in new columns as calculating the score is simple from that point on.
I had thought that I would be able to create a list for the 6 injury columns and then apply a sort to each list taking the top three values. My code for that was:
ais$ais_list <- setNames(split(ais[,2:7], seq(nrow(ais))), rownames(ais))
But I struggled to apply the sort to the lists within the data frame as unfortunately some of the data in my data set includes NA values
We could use apply row-wise and sort the dataframe and take only first three values in each row.
cbind(df[1], t(apply(df[-1], 1, sort, decreasing = TRUE)[1:3, ]))
# pt_id 1 2 3
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1
As some values may contain NA it is better we apply sort using anonymous function and then take take top 3 values using head.
cbind(df[1], t(apply(df[-1], 1, function(x) head(sort(x, decreasing = TRUE), 3))))
A tidyverse option is to first gather the data, arrange it in descending order and for every row select only first three values. We then replace the injury column with the column names which we want and finally spread the data back to wide format.
library(tidyverse)
df %>%
gather(injury, value, -pt_id) %>%
arrange(desc(value)) %>%
group_by(pt_id) %>%
slice(1:3) %>%
mutate(injury = 1:3) %>%
spread(injury, value)
# pt_id `1` `2` `3`
# <int> <int> <int> <int>
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1

Panel Data in R: Get complete cases of data based on individuals

I'm working on an unbalanced panel dataset. Data came from a game and for every user (user_id) in the record I have data for every level (level) of the game. As recording data started some time after introduction of the game, for some users I don't have data regarding the first levels, that's why I want to throw them out in a first step.
I've tried the complete.cases-function, but it only excludes the rows with the missing values (NAs), but not data for the whole user with missing values in level 1.
panel <- panel[complete.cases(panel), ]
That's why I need a code that excludes every user who has no record in level 1 (which in my dataset means he has an "NA" at one of the dependent variables, i.e. number of activities).
Update #1:
Data looks like this (thanks to thc):
> game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=c(0,150,170,80,100,110,75,100,0))
> game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I now want to exclude data from player 1, because he has a score of 0 in level 1.
Here is one approach
Example data:
game_data <- data.frame(player = c(1,1,2,2,2,3,3,3), level = c(2,3,1,2,3,1,2,3), score=sample(100, 8))
> game_data
player level score
1 1 2 19
2 1 3 13
3 2 1 65
4 2 2 32
5 2 3 22
6 3 1 98
7 3 2 58
8 3 3 84
library(dplyr)
game_data %>% group_by(player) %>% filter(any(level == 1)) %>% as.data.frame
player level score
1 2 1 65
2 2 2 32
3 2 3 22
4 3 1 98
5 3 2 58
6 3 3 84
I think I now find a solution with your help:
game_data %>% group_by(player) %>% filter(any(level == 1 & score > 0)) %>% as.data.frame
This seems to work and I just needed a little adjustment from your code thc, thank you very much for your help!

Add missing values in time series efficiently

I have 500 datasets (panel data). In each I have a time series (week) across different shops (store). Within each shop, I would need to add missing time series observations.
A sample of my data would be:
store week value
1 1 50
1 3 52
1 4 10
2 1 4
2 4 84
2 5 2
which I would like to look like:
store week value
1 1 50
1 2 0
1 3 52
1 4 10
2 1 4
2 2 0
2 3 0
2 4 84
2 5 2
I currently use the following code (which works, but takes very very long on my data):
stores<-unique(mydata$store)
for (i in 1:length(stores)){
mydata <- merge(
expand.grid(week=min(mydata$week):max(mydata$week)),
mydata, all=TRUE)
mydata[is.na(mydata)] <- 0
}
Are there better and more efficient ways to do so?
Here's a dplyr/tidyr option you could try:
library(dplyr); library(tidyr)
group_by(df, store) %>%
complete(week = full_seq(week, 1L), fill = list(value = 0))
#Source: local data frame [9 x 3]
#
# store week value
# (int) (int) (dbl)
#1 1 1 50
#2 1 2 0
#3 1 3 52
#4 1 4 10
#5 2 1 4
#6 2 2 0
#7 2 3 0
#8 2 4 84
#9 2 5 2
By default, if you don't specify the fill parameter, new rows will be filled with NA. Since you seem to have many other columns, I would advise to leave out the fill parameter so you end up with NAs, and if required, make another step with mutate_each to turn NAs into 0 (if that's appropriate).
group_by(df, store) %>%
complete(week = full_seq(week, 1L)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), -store, -week)

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

R saving the output of table() into a data frame

I have the following data frame:
id<-c(1,2,3,4,1,1,2,3,4,4,2,2)
period<-c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df<-data.frame(id,period)
typing
table(df)
results in
period
id calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
however if I save it as a data frame 'df'
df<-data.frame(table(df))
the format of 'df' would be like
id period Freq
1 1 calib 2
2 2 calib 1
3 3 calib 1
4 4 calib 0
5 1 first 1
6 2 first 2
7 3 first 0
8 4 first 0
9 1 valid 0
10 2 valid 0
11 3 valid 2
12 4 valid 3
how can I avoid this and how can I save the first output as it is into a data frame?
more importantly is there any way to get the same result using 'dcast'?
Would this help?
> data.frame(unclass(table(df)))
calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
To elaborate just a little bit. I've changed the ids in the example data.frame such that your ids are not 1:4, in order to prove that the ids are carried along into the table and are not a sequence of row counts.
id <- c(10,20,30,40,10,10,20,30,40,40,20,20)
period <- c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df <- data.frame(id,period)
Create the new data.frame one of two ways. rengis answer is fine for 2-column data frames that have the id column first. It won't work so well if your data frame has more than 2 columns, or if the columns are in a different order.
Alternative would be to specify the columns and column order for your table:
df3 <- data.frame(unclass(table(df$id, df$period)))
the id column is included in the new data.frame as row.names(df3). To add it as a new column:
df3$id <- row.names(df3)
df3
calib first valid id
10 1 2 0 10
20 2 0 2 20
30 0 0 2 30
40 1 1 1 40

Resources