I have a large data.table containing many time-dependent variables(50+) for use in coxph models. This dataset has been generated by using tmerge. Patients are identified by the patid variable and time intervals are defined by tstart and tstop.
The majority of the models I want to fit only use a selection of these time-dependent variables. Unfortunately the speed of Cox proportional hazards models is dependent on the number of rows and the number of timepoints in my data.table even if all the data in these rows is identical. Is there a good/fast way of combining rows which are identical apart from the time interval in order to speed up my models? In many cases, tstop for one line is equal to tstart for the next with everything else identical after removing some columns.
For example I would want to convert the data.table example into results.
library(data.table)
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
results=data.table(patid = c(1,1,2,2), tstart=c(0,2,0,1), tstop=c(2,3,1,3), x=c(0,1,1,2), y=c(0,1,2,3))
This example is extremely simplified. My current dataset has ~600k patients, >20M rows and 3.65k time points. Removing variables should significantly reduce the number of needed rows which should significantly increase the speed of models fit using a subset of variables.
The best I can come up with is:
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
example = example[order(patid,tstart),]
example[,matched:=x==shift(x,-1)&y==shift(y,-1),by="patid"]
example[is.na(matched),matched:=FALSE,by="patid"]
example[,tstop:=ifelse(matched,shift(tstop,-1),tstop)]
example[,remove:=tstop==shift(tstop),by="patid"]
example = example[is.na(remove) | remove==FALSE,]
example$matched=NULL
example$remove=NULL
This solves this example; however, this is pretty complex and overkill code and when I have a number of columns in the dataset having to edit x==shift(x,-1) for each variable is asking for error. Is there a sane way of doing this? The list of columns will change a number of times based on loops, so accepting as input a vector of column names to compare would be ideal.
This solution also doesn't cope with multiple time periods in a row that contain the same covariate values(e.g. time periods of (0,1), (1,3), (3,4) with the same covariate values)
this solution create a temporary group-id based on the rleid() of the combination of x and y. This temp value is used, and then dropped (temp := NULL)
example[, .(tstart = min(tstart), tstop = max(tstop), x[1], y[1]),
by = .(patid, temp = rleid(paste(x,y, sep = "_")))][, temp := NULL][]
# patid tstart tstop x y
# 1: 1 0 2 0 0
# 2: 1 2 3 1 1
# 3: 2 0 1 1 2
# 4: 2 1 3 2 3
Here is an option that builds on our conversation/comments above, but allows the flexibility of setting a vector column names:
cols=c("x","y")
cbind(
example[, id:=rleidv(.SD), .SDcols = cols][, .(tstart=min(tstart), tstop=max(tstop)), .(patid,id)],
example[,.SD[1],.(patid,id),.SDcols =cols][,..cols]
)[,id:=NULL][]
Output:
patid tstart tstop x y
1: 1 0 2 0 0
2: 1 2 3 1 1
3: 2 0 1 1 2
4: 2 1 3 2 3
Based on Wimpel's answer I have created the following solution which also allows using a vector of column names for input.
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
variables = c("x","y")
example[,key_ := do.call(paste, c(.SD,sep = "_")),.SDcols = variables]
example[, c("tstart", "tstop") := .(min(tstart),max(tstop)),
by = .(patid, temp = rleid(key_))][,key_:=NULL]
example = unique(example)
I would imagine this could be simplified, but I think it does what is needed for more complex examples.
I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.
I have a data frame DF which looks like this:
ID Area time
1 1 182.685 1
2 2 182.714 1
3 3 182.275 1
4 4 211.928 1
5 5 218.804 1
6 6 183.445 1
...
1 1 184.334 2
2 2 196.765 2
3 3 186.435 2
4 4 213.322 2
5 5 214.766 2
6 6 172.667 2
.. and so to ID = 6. I want to apply an autocorrelation function on each ID, i.e. compare ID = 1 at time 1 with ID = 1 at time 2 and so on.
What is the most straightforward way to apply e.g. acf() to my data frame?
When I try to use
autocorr = aggregate(x = DF$Area, by = list(DF$ID), FUN = acf)
I get a weird object.
Thanks in advance!
I want to apply an autocorrelation function on each ID
OK, good, so you don't want any cross-correlation, which make things much easier.
I get a weird object
acf returns a bunch of things, i.e., it returns a list of things. I think you will be only interested in ACF values, so you need:
FUN = function (u) c(acf(u, plot = FALSE)$acf)
Also, using aggregate is not a good idea. You may want split and sapply:
## so your data frame is called `x`
oo <- sapply(split(x$Area, x$ID), FUN = function (u) c(acf(u, plot = FALSE)$acf) )
If you have balanced data, i.e., if you have equal number of observations for each ID, oo will be simplified into a matrix for sure. If you do not have balanced data, you may want to explicitly control the lag.max argument in acf. By default, acf will auto-decide on this value based on the number of observations.
Now suppose we want lag 0 to lag 7, we can set:
oo <- sapply(split(x$Area, x$ID),
FUN = function (u) c(acf(u, plot = FALSE, lag.max = 7)$acf) )
Thus result oo is a matrix of 8 rows (row for lag, column for ID). I don't see any good of using a data frame to hold this result, but in case you want a data frame, simply do:
data.frame(oo)
With data either in a matrix or a data frame, it is easy for you to do further analysis.
-----------
For a complete description of acf, please read Produce a boxplot for multiple ACFs
I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)
I have two file.dat (random1.dat and random2.dat) which are generated from a random uniform distribution (changing the seed):
http://www.filedropper.com/random1_1: random1.dat
http://www.filedropper.com/random2 : random2.dat
I like to use R to make the X-squared to understand if the two distribution are statistically the same.
To do that i prove:
x1 -> read.table("random1.dat")
x2 -> read.table("random2.dat")
chisq.test(x1,x2)
but I receive an error message:
'x' and 'y' need to have the same length
Now the problem is that this two files are both 1000's rows. So I don't understand that. Another question is if I want to make this process automatic (iterate it) for istance 100 times with 100 different file, can i make something like:
DO i=1,100
x1 -> read.table("random'(i)'.dat")
x2 -> read.table("fixedfile.dat")
chisq.test(x1,x2)
save results from the chisq analys
END DO
Thanks so much for Your help.
ADDED:
#eipi10,
I try to use the first method You gave here and it works well for the data You generate here. Then, when I try it for my data (I put in a single file a 2-column matrix enter link description here of 1000 rows of two uniform distribution with a different seed) something do not work correctly:
I load the file with: dat = read.table("random2col.dat");
I use the command: csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x))) and a warning message appear;
finally I use: unlist(lapply(csq, function(x) x$p.value)) BUT the output is something like:
[...] 1 1 1 1 1 1 1 1 1 1 1 1 1
[963] 1 1 1 1 1.....1 1 1 1
[1000] 1
I don't think you need to use a loop. You can use lapply instead. Also, you're entering x1 and x2 as separate columns of data. When you do this, chisq.test computes a contingency table from these two columns, which wouldn't be meaningful for columns of real numbers. Instead, you need to feed chisq.test a single matrix or data frame whose columns are x1 and x2. But even then, the chisq.test is expecting count data, which isn't what you have here (although the "expected" frequency doesn't necessarily have to be an integer). In any case, here's some code that will make the test run the way you seem to be hoping:
# Simulate data: 5 columns of data, each from the uniform distribution
dat = data.frame(replicate(5, runif(20)))
# Chi-Square test of each column against column 1.
# Note use of cbind to combine the two columns into a single data frame,
# rather than entering each column as separate arguments.
csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x)))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)
On the other hand, if you were intending your data to be two streams of values that would then be converted into a contingency table, here's an example of that:
# Simulate data of 5 factor variables, each with 10 different levels
dat = data.frame(replicate(5, sample(c(1:10), 1000, replace=TRUE)))
# Chi-Square test of each column against column 1. Here the two columns of data are
# entered as separate arguments, so that chisq.test will convert them to a two-way
# contingency table before doing the test.
csq = lapply(dat[,-1], function(x) chisq.test(dat[,1],x))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)