Related
I have a large data.table containing many time-dependent variables(50+) for use in coxph models. This dataset has been generated by using tmerge. Patients are identified by the patid variable and time intervals are defined by tstart and tstop.
The majority of the models I want to fit only use a selection of these time-dependent variables. Unfortunately the speed of Cox proportional hazards models is dependent on the number of rows and the number of timepoints in my data.table even if all the data in these rows is identical. Is there a good/fast way of combining rows which are identical apart from the time interval in order to speed up my models? In many cases, tstop for one line is equal to tstart for the next with everything else identical after removing some columns.
For example I would want to convert the data.table example into results.
library(data.table)
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
results=data.table(patid = c(1,1,2,2), tstart=c(0,2,0,1), tstop=c(2,3,1,3), x=c(0,1,1,2), y=c(0,1,2,3))
This example is extremely simplified. My current dataset has ~600k patients, >20M rows and 3.65k time points. Removing variables should significantly reduce the number of needed rows which should significantly increase the speed of models fit using a subset of variables.
The best I can come up with is:
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
example = example[order(patid,tstart),]
example[,matched:=x==shift(x,-1)&y==shift(y,-1),by="patid"]
example[is.na(matched),matched:=FALSE,by="patid"]
example[,tstop:=ifelse(matched,shift(tstop,-1),tstop)]
example[,remove:=tstop==shift(tstop),by="patid"]
example = example[is.na(remove) | remove==FALSE,]
example$matched=NULL
example$remove=NULL
This solves this example; however, this is pretty complex and overkill code and when I have a number of columns in the dataset having to edit x==shift(x,-1) for each variable is asking for error. Is there a sane way of doing this? The list of columns will change a number of times based on loops, so accepting as input a vector of column names to compare would be ideal.
This solution also doesn't cope with multiple time periods in a row that contain the same covariate values(e.g. time periods of (0,1), (1,3), (3,4) with the same covariate values)
this solution create a temporary group-id based on the rleid() of the combination of x and y. This temp value is used, and then dropped (temp := NULL)
example[, .(tstart = min(tstart), tstop = max(tstop), x[1], y[1]),
by = .(patid, temp = rleid(paste(x,y, sep = "_")))][, temp := NULL][]
# patid tstart tstop x y
# 1: 1 0 2 0 0
# 2: 1 2 3 1 1
# 3: 2 0 1 1 2
# 4: 2 1 3 2 3
Here is an option that builds on our conversation/comments above, but allows the flexibility of setting a vector column names:
cols=c("x","y")
cbind(
example[, id:=rleidv(.SD), .SDcols = cols][, .(tstart=min(tstart), tstop=max(tstop)), .(patid,id)],
example[,.SD[1],.(patid,id),.SDcols =cols][,..cols]
)[,id:=NULL][]
Output:
patid tstart tstop x y
1: 1 0 2 0 0
2: 1 2 3 1 1
3: 2 0 1 1 2
4: 2 1 3 2 3
Based on Wimpel's answer I have created the following solution which also allows using a vector of column names for input.
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
variables = c("x","y")
example[,key_ := do.call(paste, c(.SD,sep = "_")),.SDcols = variables]
example[, c("tstart", "tstop") := .(min(tstart),max(tstop)),
by = .(patid, temp = rleid(key_))][,key_:=NULL]
example = unique(example)
I would imagine this could be simplified, but I think it does what is needed for more complex examples.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
EDIT: There were lots of problems in my first example so I am reworking them here. This is primarily to direct credit towards the original responder who cut my process time by a factor of about 180 even with my poor example. This question was frozen for being unclear, or not general enough, but I think it has value as data.table can do amazing things with the right syntax, but that syntax can be elusive even with the available vignettes. From my own experience, having more examples of how data.table can be used will be helpful. Particularly for those of us who got our start in Excel the VLOOKUP like behavior here fills a gap that is not always easy to find.
The specific things that happen in this example that may be of general interest are:
looking up values in one data.table in another data.table
passing variables by name and by reference
apply like behavior in data.table
Original question with modified (limited rows) example:
I am looking for help in the arcane world of data.table, passing functions, and fast use of lookups across multiple tables. I have a larger function that, when I profile it, seems to spend all of its time in this one area doing some fairly straightforward lookup and sum actions. I am not adept enough at profiling to figure out exactly which subareas of the call are causing the problem but my guess is that I am unintentionally doing something computationally expensive that I don't need to do. Data.table syntax is still a complete mystery to me, so I am seeking help here to speed this process up.
Small worked example:
library(data.table)
set.seed(seed = 911)
##Other parts of the analysis generate all of these data.tables
#A data table containing id values (the real version has other things too)
whoamI<-data.table(id=1:5)
#The result of another calculation it tells me how many neighbors I will be interested in
#the real version has many more columns in it.
howmanyneighbors<-data.table(id=1:5,toCount=round(runif(5,min=1,max=3),0))
#Who the first three neighbors are for each id
#real version has a hundreds of neighbors
myneighborsare<-data.table(id=1:5,matrix(1:5,ncol=3,nrow=5,byrow = TRUE))
colnames(myneighborsare)<-c("id","N1","N2","N3")
#How many of each group live at each location?
groupPops<-data.table(id=1:5,matrix(floor(runif(25,min=0,max=10)),ncol=5,nrow=5))
colnames(groupPops)<-c("id","ape","bat","cat","dog","eel")
whoamI
howmanyneighbors
myneighborsare
groupPops
> whoamI
id
1: 1
2: 2
3: 3
4: 4
5: 5
> howmanyneighbors
id toCount
1: 1 2
2: 2 1
3: 3 3
4: 4 3
5: 5 2
> myneighborsare
id N1 N2 N3
1: 1 1 2 3
2: 2 4 5 1
3: 3 2 3 4
4: 4 5 1 2
5: 5 3 4 5
> groupPops
id ape bat cat dog eel
1: 1 9 8 6 8 1
2: 2 9 8 0 9 8
3: 3 6 1 9 1 2
4: 4 6 1 9 0 3
5: 5 6 2 2 2 5
##At any given time I will only want the group populations for some of the groups
#I will always want 'ape' but other groups will vary. Here I have picked two
#I retain this because passing the column names by variable along with the pass of 'ape' was tricky
#and I don't want to lose that syntax in any new answer
animals<-c("bat","eel")
i<-2 #similarly, howmanyneighbors has many more columns in it and I need to pass a reference to one of them which I call i here
##Functions I will call on the above data
#Get the ids of my neighbors from myneighborsare. The number of ids returned will vary based on value in howmanyneighbors
getIDs<-function(a){myneighborsare[id==a,2:(as.numeric(howmanyneighbors[id==a,..i])+1)]} #so many coding fails here it pains me to put this in public view
#Sum the populations of my neighbors for groups I am interested in.
sumVals<-function(b){colSums(groupPops[id%in%b,c("ape",..animals)])} #cringe
#Wrap the first two together and put them into a format that works well with being returned as a row in a data.table
doBoth<-function(a){
ro.ws<-getIDs(a)
su.ms<-sumVals(ro.ws)
answer<-lapply(split(su.ms,names(su.ms)),unname) #not too worried about this as it just mimics some things that happen in the original code at little time cost
return(answer)
}
#Run the above function on my data
result<-data.table(whoamI)
result[,doBoth(id),by=id]
id ape bat eel
1: 1 18 16 9
2: 2 6 1 3
3: 3 21 10 13
4: 4 24 18 14
5: 5 12 2 5
This involves a reshape and non-equi join.
library(data.table)
# reshape to long and add a grouping ID for a non-equi join later
molten_neighbors <- melt(myneighborsare, id.vars = 'id')[, grp_id := .GRP, by = variable]
#regular join by id
whoamI[howmanyneighbors,
on = .(id)
#non-equi join - replaces getIDs(a)
][molten_neighbors,
on = .(id, toCount >= grp_id),
nomatch = 0L
#regular join - next steps replace sumVals(ro.ws)
][groupPops[, c('id','ape', ..animals)],
on = .(value = id),
.(id, ape, bat, eel),
nomatch = 0L,
][,
lapply(.SD, sum),
keyby = id
]
I highly recommend simplifying future questions. Using 10 rows allows you to post the tables within your question. As is, it was somewhat difficult to follow.
I'm really sorry to ask this dumb question but I don't get what is going wrong.
I have a dataset which I convert into a data.table object :
#generate 100,000 ids associated to a group in a data-set called base
id=c(1:100000)
group=sample(c(1:5),100000,TRUE)
base=cbind(id,group)
base=as.data.table(base)
I make a basic group by computation to get the number of rows by group, and the result table still contains the same number of rows
counting=base[,COUNT:= .N, by = group]
nrow(counting)
#100000
What did I miss? Is there an option in data.table in order to address my problem?
Taking akrun's comment, I decided to provide an answer. It seems that you were not sure how to summarise your data and got confused. First, one point about constructing a data set:
set.seed(123)
id = c(1:100000)
group = sample(c(1:5),100000,TRUE)
base = data.frame(id,group)
setDT(base)
base
id group
1: 1 2
2: 2 4
3: 3 3
4: 4 5
5: 5 5
....
When you use cbind() on multiple vectors, they are coerced to the same class to make a matrix. The safer way to go is to use data.frame(), which allows mixed column classes. And, if you have a data.frame, you can turn it into a data.table by reference with setDT, without needing to assign the result.
Adding a new column. Your code was basically adding a new column in the data.table object. When you use :=, you are doing the equivalent of mutate() in dplyr or transform() in base R, with one important difference. With :=, the column is added to the data.table by reference, so there is no need to assign the result.
base[, COUNT := .N, by = group]
base
id group COUNT
1: 1 2 20099
2: 2 4 19934
3: 3 3 20001
4: 4 5 19933
5: 5 5 19933
...
Here, you are counting how many data points exist for each group, and you are assigning the values to all rows. For instance, the total count of group 2 is 20099. You give this number to all rows with group == 2. You are creating a new column, not summarizing the data. Hence, you still have 100000 rows. The number of rows in base is the same as ever. There is currently no function to modify the number of rows by reference.
Summarising the data. If you want to count how many data points exist for each group and summarize the data, you want the following.
dt2 <- base[, .(COUNT = .N), by = group]
dt2
group COUNT
1: 2 20099
2: 4 19934
3: 3 20001
4: 5 19933
5: 1 20033
dim(dt2)
[1] 5 2
Here, you want to make sure that you use =, not := since you are summarising the data. It is necessary to assign the result because we are creating a new data.table. I hope this clears up your mind.
Have you noticed?
base$regroup = group
base[, .(Count = .N, regroup), by = group]
gives 100,000 rows even though group and regroup are identical?
This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data table which could be reduced to this:
set.seed(1);
dt<-data.table(form=c(1,1,1,2,3,3,3,4,4,5),
mx=c("a","b","c","d","e","f","g","e","g","b"),
vr=runif(10,100,200),
usr=c("l","l","l","m","o","o","o","l","l","m"),
type=c("A","A","A","C","C","C","C","C","C","A"))
I can generate a table with:
dt[,
list(n.form=length(unique(form)),n.mx=length(unique(mx)),tot.vr=sum(vr)),
by=usr]
What I haven't been able to to is to count the number of formulas of type A (each row is an observation, the form is the formula number). I've tried:
dt[,
list(n.form=length(unique(form)),n.mx=length(unique(mx)),tot.vr=sum(vr),n.A=sum(type=="A"),
by=usr]
and also:
dt[,
list(n.form=length(unique(form)),n.mx=length(unique(mx)),tot.vr=sum(vr),n.A=length(unique(type=="A"))),
by=usr]
but none of those takes into account the fact that the number of "A" found needs to be related to the unique formula (form) number.
What I'd like to have as a result is:
usr n.form n.mx tot.vr n.A
1: l 2 5 750.0398 1
2: m 2 2 296.9994 1
3: o 1 3 504.4747 0
but I can't find a way to achieve it. Any light shed is much appreciated.
Thanks,
======= EDIT TO ADD ========
I want to know how many of the formulas (unique numbers in dt$form) are of type "A" (so I can calculate a proportion out of total formulas). The direct number (sum) is the total number of observations of type A, while the existence (any) gives me if there was at least one formula of type "A", but not the number of formulas of that type (which is what I want). Please notice that any given formula will always be either of type "A" or "C" (not mixed types in one formula)
In the devel version of data.table, you can use uniqueN instead of length(unique(..,
library(data.table)#v1.9.5+
dt[,list(n.form=uniqueN(form), n.mx=uniqueN(mx),tot.vr=sum(vr),
n.A=uniqueN(form[type=='A'])) , by = usr]
# usr n.form n.mx tot.vr n.A
#1: l 2 5 750.0398 1
#2: m 2 2 296.9994 1
#3: o 1 3 504.4747 0
I have two vectors (A and B) with categorical data of 36 subjects. A_i,j being the categorytype1 j, subject i fits into and B_i,k is categorytype2 k of subject i. With i=1:36, j=1:5 and k=1:6.
library(mlogit)
AB <- read.csv("C:/.../AB.csv")
head(AB)
Subject A B
1 1 1 3
2 2 3 3
3 3 1 6
4 4 1 3
5 5 1 2
6 6 1 4
I would like to find a probability for every category combination. So with what chance does a subject choose category j and k for all j=1:5 and k=1:6.
I was told the probit/logit model was a great tool to use for this problem and I tried estimating it in R.
mldata<-mlogit.data(AB, choice="A", alt.var="B", shape="long", id.var = "Subject")
Gives me an error and I can not find my mistake.
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.3", "1.3", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1.3’, ‘2.2’, ‘2.3’, ‘3.1’,‘3.5’,‘4.2’,‘4.3’, ‘5.3’, ‘5.4’, ‘6.5’, ‘7.3’, ‘8.2’, ‘8.3’
I tried looking through the help files but has not helped me a lot.
I hope someone can point out the mistake(s) I'm making.
Thank you very much for your help.
Post output of dput(A) and dput(b) and specify what the first couple of answers should be. . Looks like you want rowSums(.)/6 across some logical operation on those two matrices. Probably:
rowSums(A==B)/6