I have a large data.table containing many time-dependent variables(50+) for use in coxph models. This dataset has been generated by using tmerge. Patients are identified by the patid variable and time intervals are defined by tstart and tstop.
The majority of the models I want to fit only use a selection of these time-dependent variables. Unfortunately the speed of Cox proportional hazards models is dependent on the number of rows and the number of timepoints in my data.table even if all the data in these rows is identical. Is there a good/fast way of combining rows which are identical apart from the time interval in order to speed up my models? In many cases, tstop for one line is equal to tstart for the next with everything else identical after removing some columns.
For example I would want to convert the data.table example into results.
library(data.table)
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
results=data.table(patid = c(1,1,2,2), tstart=c(0,2,0,1), tstop=c(2,3,1,3), x=c(0,1,1,2), y=c(0,1,2,3))
This example is extremely simplified. My current dataset has ~600k patients, >20M rows and 3.65k time points. Removing variables should significantly reduce the number of needed rows which should significantly increase the speed of models fit using a subset of variables.
The best I can come up with is:
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
example = example[order(patid,tstart),]
example[,matched:=x==shift(x,-1)&y==shift(y,-1),by="patid"]
example[is.na(matched),matched:=FALSE,by="patid"]
example[,tstop:=ifelse(matched,shift(tstop,-1),tstop)]
example[,remove:=tstop==shift(tstop),by="patid"]
example = example[is.na(remove) | remove==FALSE,]
example$matched=NULL
example$remove=NULL
This solves this example; however, this is pretty complex and overkill code and when I have a number of columns in the dataset having to edit x==shift(x,-1) for each variable is asking for error. Is there a sane way of doing this? The list of columns will change a number of times based on loops, so accepting as input a vector of column names to compare would be ideal.
This solution also doesn't cope with multiple time periods in a row that contain the same covariate values(e.g. time periods of (0,1), (1,3), (3,4) with the same covariate values)
this solution create a temporary group-id based on the rleid() of the combination of x and y. This temp value is used, and then dropped (temp := NULL)
example[, .(tstart = min(tstart), tstop = max(tstop), x[1], y[1]),
by = .(patid, temp = rleid(paste(x,y, sep = "_")))][, temp := NULL][]
# patid tstart tstop x y
# 1: 1 0 2 0 0
# 2: 1 2 3 1 1
# 3: 2 0 1 1 2
# 4: 2 1 3 2 3
Here is an option that builds on our conversation/comments above, but allows the flexibility of setting a vector column names:
cols=c("x","y")
cbind(
example[, id:=rleidv(.SD), .SDcols = cols][, .(tstart=min(tstart), tstop=max(tstop)), .(patid,id)],
example[,.SD[1],.(patid,id),.SDcols =cols][,..cols]
)[,id:=NULL][]
Output:
patid tstart tstop x y
1: 1 0 2 0 0
2: 1 2 3 1 1
3: 2 0 1 1 2
4: 2 1 3 2 3
Based on Wimpel's answer I have created the following solution which also allows using a vector of column names for input.
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
variables = c("x","y")
example[,key_ := do.call(paste, c(.SD,sep = "_")),.SDcols = variables]
example[, c("tstart", "tstop") := .(min(tstart),max(tstop)),
by = .(patid, temp = rleid(key_))][,key_:=NULL]
example = unique(example)
I would imagine this could be simplified, but I think it does what is needed for more complex examples.
Related
I have a dataset (named 'gala') that has the columns "Day", "Tree", "Trt", and "Countable". The data was collected over time, so each numbered tree is the same tree for each treatment is the same across all days. The tree numbers are repeated for each treatment (e.g. there is a tree "1" for multiple treatments). I want to know the proportion/frequency of the "Countable" column values. I have converted the values in the "Countable" column to binomial ("0" and "1").
I would like to compute the relative frequency of "1" vs. "0" for the 'Countable' column, for each tree per each treatment per each day (e.g. If I had eight 1's and two 0's, the new column value would be "0.8" to summarize with one value that tree for that treatment on that day), and output these results into a new data frame that also includes the original day, Tree, Trt values.
I have been unsuccessfully trying to make a Frankenstein of codes from other Stack Overflow answers, but I cannot get the codes to work. Many people use "sum" but I do not want the sum, I would just like R to treat the "0" and "1" like categorical values and give me the relative proportion of each for each subset of data. If I missed this, I am sorry, and please let me know with a link to this answer. I am new to coding, and R, and do not understand well how other codes not directly relating to what I would like to do can be applied.
It looks like dplyr is probably my best option, based on what I've seen for other similar questions. This is what I have thus far, but I keep getting various errors:
library(dplyr)
RelativeFreq <-
(gala %>%
group_by(Day, Tree, Trt) %>%
summarise(Countable) %>%
mutate(rel.freq=n/length(Countable)))
I've also tried this with no success:
RelativeFreq <- gala[,.("proportion"=frequency(Countable[0,1])), by=c("Day","Tree","Trt")]
Any help is greatly appreciated. Thank you!
you could use data.table:
# create fake data
set.seed(0)
df <- expand.grid(Day = 1:2,
Tree = 1:2,
Trt = 1:2)
df<- rbind(df, df, df)
library(data.table)
# make df a data.table
setDT(df)
# create fake Countable column
df[, Countable := as.integer(runif(.N) < 0.5)]
RelativeFreq <- df[, list(prop = sum(Countable)/.N), by = list(Day, Tree, Trt)]
RelativeFreq
Day Tree Trt prop
1: 1 1 1 0.3333333
2: 2 1 1 0.3333333
3: 1 2 1 0.6666667
4: 2 2 1 0.6666667
5: 1 1 2 0.3333333
6: 2 1 2 0.3333333
7: 1 2 2 0.6666667
8: 2 2 2 0.0000000
I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.
i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA
I have a dataset (df) where I would just like to get some summary stats for the entire column variables and then a summary for the variables of 2 specific treatments. So far so good:
summary(var1)
aggregate(var1 ~ treatment, results, summary)
I then have one variable that are values of 1 and 2. I can count these with the sum function:
sum(var3 == 1)
sum(var3 == 2)
However, when I try to sum these by treatment:
aggregate(var3 ~ treatment, results, sum var3 == 1)
I get the following error:
Error in sum == 1 :
comparison (1) is possible only for atomic and list types
I have tried lots of variations on the same theme and taken a look through the textbooks I am using to help me with my first forays into R... but I can't seem to find the answer.
Here's a sample dataset (it's always best to include sample data to make your question reproducible).
set.seed(15)
results<-data.frame(
var1=runif(30),
var3=sample(1:2, 30, replace=T),
treatment=gl(2,15)
)
If you really want to use aggregate, you can do
aggregate(var3==1~treatment, results, sum)
# treatment var3 == 1
# 1 1 9
# 2 2 5
but since you're counting discrete observations, table() may be a better choice to do all the counting at once
with(results, table(var3, treatment))
# treatment
# var3 1 2
# 1 9 5
# 2 6 10
I have the following code, which does what I want. But I would like to know if there is a simpler/nicer way of getting there?
The overall aim of me doing this is that I am building a separate summary table for the overall data, so the average which comes out of this will go into that summary.
Test <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Thing = c("Apple","Apple","Pear","Pear","Apple","Apple","Kiwi","Apple","Pear"),
Day = c("Mon","Tue","Wed")
)
countfruit <- function(data){
df <- as.data.frame(table(data$ID,data$Thing))
df <- dcast(df, Var1 ~ Var2)
colnames(df) = c("ID", "Apple","Kiwi", "Pear")
#fixing the counts to apply a 1 for if there is any count there:
df$Apple[df$Apple>0] = 1
df$Kiwi[df$Kiwi>0] = 1
df$Pear[df$Pear>0] = 1
#making a new column in the summary table of how many for each person
df$number <- rowSums(df[2:4])
return(mean(df$number))}
result <- countfruit(Test)
I think you over complicate the problem, Here a small version keeping the same rationale.
df <- table(data$ID,data$Thing)
mean(rowSums(df>0)) ## mean of non zero by column
EDIT one linear solution:
with(Test , mean(rowSums(table(ID,Thing)>0)))
It looks like you are trying to count how many nonzero entries in each column. If so, either use as.logical which will convert any nonzero number to TRUE (aka 1) , or just count the number of zeros in a row and subtract from the number of pertinent columns.
For example, if I followed your code correctly, your dataframe is
Var1 Apple Kiwi Pear
1 1 2 0 1
2 2 2 0 1
3 3 1 1 1
So, (ncol(df)-1) - length(df[1,]==0) gives you the count for the first row.
Alternatively, use as.logical to convert all nonzero values to TRUE aka 1 and calculate the rowSums over the columns of interest.