I am using data.table in R and looping over my table, it s really slow because of my table size.
I wonder if someone have any idea on
I have a set of value that I want to "cluster".
Each line have a position, a positive integer. You can load a simple view of that :
library(data.table)
#Here is a toy example
fulltable=c(seq (1,4))*c(seq(1,1000,10))
fulltable=data.table(pos=fulltable[order(fulltable)])
fulltable$id=1
So I loop in my lines and When there is more than 50 between two position I change the group :
#here is the main loop
lastposition=fulltable[1]$pos
lastid=fulltable[1]$id
for(i in 2:nrow(fulltable)){
if(fulltable[i]$pos-50>lastposition){
lastid=lastid+1
print(lastid)
}
fulltable[i]$id=lastid;
lastposition=fulltable[i]$pos
}
Any idea for an effi
fulltable[which((c(fulltable$pos[-1], NA) - fulltable$pos) > 50) + 1, new_group := 2:(.N+1)]
fulltable[is.na(new_group), new_group := 1]
fulltable[, c("lastid_new", "new_group") := list(cummax(new_group), NULL)]
Related
I need to work out a fast way of conditionally finding the difference, in days, between two dates in a data table. I managed to do it with an "ifelse" statement but it is slow on big objects, so my question, is there a faster, more elegant way of achieving the same, perhaps using data.table commands like ":=" or something. Thx. J.
library(lubridate)
library(data.table)
rm(list = ls())
a <- as.Date(c ("2021-09-27", "2019-10-30", "2021-09-05"))
b <- as.Date(c ("2020-06-14", "2019-09-15", "2020-09-23"))
c <- as.Date(c ("2022-07-12", "2020-09-23", "2021-06-19"))
new <- data.table(leave = a, start = b, end = c)
new$days <- ifelse (
new$leave < new$end,
new$leave - new$start,
new$end - new$start)
So in words, when leaving date < end of period, subtract leave from start, however if leave >= end then subtract start from end, and give result back in a new column in days.
Using the pmin() function and data.table's assign operator :=
new[, days := as.numeric(pmin(leave, end)-start)]
Or you could assign it to all rows one way then chain off a subset:
new[, days := as.numeric(end - start)][leave < end, days := leave - start]
Or take advantage of by and the .GRP special character:
new[, days := list(leave-start, end-start)[.GRP], keyby=.(leave>=end)]
When using a list column of data.tables in a nested data.table it is easy to apply a function over the column. Example:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
We can use:
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
or
dt[, list( length = lapply(dt.mtcars, nrow)), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
I would like to do the same process and apply a modification by reference using the operator := to each data.table of the column.
Example:
modify_by_ref<- function(d){
d[, max_hp:= max(hp)]
}
dt[, modify_by_ref(dt.mtcars[[1]]), by = gear]
That returns the error:
Error in `[.data.table`(d, , `:=`(max_hp, max(hp))) :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Using the tip in the error message do not works in any way for me, it seems to be targeting another case but maybe I am missing something. Is there any recommended way or flexible workaround to modify list columns by refence?
This can be done in following two steps or in Single Step:
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Step 1 - Let's add list of column hp vectors in each row of dt
dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]
Step 2 - Now calculate the max of hp
dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Single Step - Single step is actually the combination of both of the above steps:
dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]
If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:
https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406
Taking inspiration from the same i am going to show how to do it here for the given data set.
Let's first clean everything:
rm(list = ls())
Let's re-define the given table in different way:
dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]
Note that i have defined the table slightly different. I have used data.table in addition to list in the above definition.
Next, populate the max by reference within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]
And, what good one can expect, we can perform manipulation within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]
I want to subtract each column from a column called df$Means in r. I want to do this as a function but Im not sure how to iterate through each of the columns- each iteration relies on one column being subtracted from df$Means and then there is a load of downstream code that uses the output. I have simplified the code for here as this is the bit that's giving me trouble. So far I have:
CopyNumberLoop <- function (i) {df$ZScore <- (df[3:5]-df$Means)/(df$sd)
}
apply(df[3:50], 2, CopyNumberLoop)
but Im not sure how to make sure that the operation is done on one column at a time. I don't think df[3:5] is correct?
I have been asked to produce a reproducible example so all the code I want is here:
df1 <- read.delim(file.choose(),header=TRUE)
#Take the control samples and average each row for three columns excluding the first two columns- add the per row means to the data frame
df$Means <- rowMeans(df[,30:32])
RowVar <- function(x) {rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)}
df$sd=sqrt(RowVar(df[,c(30:32)]))
#Get a Z score by dividing the test sample count at each locus by the average for the control samples and divide everything by the st dev for controls at each locus.
{
df$ZScore <- (df[,35]-df$Means)/(df$sd)
######################################### QUARTILE FILTER ###########################################################
alpha=1.5
numberofControls = 3
UL = median(df$ZScore, na.rm = TRUE) + alpha*IQR(df$ZScore, na.rm = TRUE)
LL = median(df$ZScore, na.rm = TRUE) - alpha*IQR(df$ZScore, na.rm = TRUE)
#Copy the Z score if the score is > or < a certain number, i.e. LL or UL.
Zoutliers <- which(df$ZScore > UL | df$ZScore < LL)
df$Zoutliers <- ifelse(df$ZScore > UL |df$ZScore <LL ,1,-1)
tempout = ifelse(df$ZScore[Zoutliers] > UL,1,-1)
######################################### Three neighbour Isolation filter ##############################################################################
finalSeb=c()
for(i in 2:(length(Zoutliers)-1)){
j=Zoutliers[i]
if(sum(ifelse((j-1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i-1] & sum(ifelse((j+1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i+1]){
finalSeb = c(finalSeb,i)
}
}
finalset_row_number = Zoutliers[finalSeb]
#View(finalset_row_number)
p_seq = rep(0,nrow(df))
for(i in 1:length(finalset_row_number)){
p_seq[(finalset_row_number[i]-1):(finalset_row_number[i]+1)] = median(df$ZScore[(finalset_row_number[i]-1):(finalset_row_number[i]+1)])
}
nrow(as.data.frame(finalset_row_number))
}
For each column between 3 and 50 I'd like to generate a nrow(as.data.frame(finalset_row_number)) and keep it in another dataframe. Admittedly my code is a mess because I dont know how to create the function that will allow me to apply this to each column
Your code isn’t using the parameter i at all. In fact, i is the current column, so that’s what you should use:
result = apply(df[, 3 : 50], 2, function (col) col - df$Means)
Or you can subtract the means directly:
result = df[, 3 : 50] - df$Means
This will return a new matrix consisting of the columns 3–50 from df, subtracting df$Means from each in turn. Or, if you want to calculate Z scores as your code seems to do:
result = (df[, 3 : 50] - df$Means) / df$sd
It appeared that you wanted the Z-scores assigned back into the original dataframe as named columns. If you want to loop over columns, it would be just as economical to use lapply or sapply. The receiving function will accept each column in turn and match it to the first parameter. Any other arguments offered after the receiving function will get matched by name or position to any other symbol/names in the parameter list. You do not do any assignment to 'df' inside the function:
CopyNumberLoop <- function (col) { col-df$Means/(df$sd)
}
df[, paste0('ZScore' , 3:50)] <- # assignment done outside the loop
lapply(df[3:50], CopyNumberLoop) # result is a list
# but the `[.data.frame<-` method will accept a list.
Usign apply coerces to a matrix which may have undesirable effects in the column is not numeric (say factor or date-time). It's better to get into he habit of using lapply when working on ranges of columns in dataframes.
If you want to assign the result of this operation to a new dataframe, then the lapply(.) result would need to be wrapped in as.data.frame and then column names could be assigned. Same effort would need to be done to a result from apply(.).
I am trying to use data.table to recode a variable based on certain conditions. My original dataset has around 30M records and after all variable creation around 130 variables. I used the methods suggested here: conditional statements in data.table (M1) and also here data.table: Proper way to do create a conditional variable when column names are not known? (M2)
My goal is get the equivalent of the below code but something that is applicable using data.table
samp$lf5 <- samp$loadfactor5
samp$lf5 <- with(samp, ifelse(loadfactor5 < 0, 0, lf5))
I will admit that I don't understand .SD and .SDCols very well, so I might be using it wrong. The code and errors from (M1) and (M2) are given below and the sample dataset is here: http://goo.gl/Jp97Wn
(M1)
samp[,lf5 = if(loadfactor5 <0) 0 else loadfactor5]
Error Message
Error in `[.data.table`(samp, , lf5 = if (loadfactor5 < 0) 0 else loadfactor5) :
unused argument (lf5 = if (loadfactor5 < 0) 0 else loadfactor5)
When I do this:
samp[,list(lf5 = if(loadfactor5 <0) 0 else loadfactor5)]
it gives lf5 as a list but not as part of the samp data.table and does not really apply the condition as lf5 still has values less than 0.
(M2)
Col1 <- "loadfactor5"
Col2 <- "lf5"
setkeyv(samp,Col1)
samp[,(Col2) :=.SD,.SDCols = Col1][Col1<0,(Col2) := .SD, .SDcols = 0]
I get the following error
Error in `[.data.table`(samp, , `:=`((Col2), .SD), .SDCols = Col1) :
unused argument (.SDCols = Col1)
Any insights on how to finish this appreciated. My dataset has 30M records so I am hoping to use data.table to really cut the run time down.
Thanks,
Krishnan
Answer provided by eddi and included here for the sake of completeness.
samp[, lf5 := ifelse(loadfactor5 < 0, 0, loadfactor5)]
Another way (which I prefer because it's, in my opinion, cleaner):
samp[, lf5 := 0]; samp[loadfactor5 > 0, lf5 := loadfactor5];
I use data.table with a dataset with 90M rows; I am continually amazed at how fast data.table is for operations like the above.
I am working on a function that takes a list of data tables with the same column names as an input and returns a single data table that has the unique rows from each data frame combined using successive rbind as shown below.
The function would be applied on a "very" large data.table (10s of millions of rows) which is why I had to split it up into several smaller data tables and assign them into a list to use recursion. At each step depending upon the length of the list of data tables (odd or even), I find the unique of data.table at that list index and the data table at the list index x - 1 and then successively rbind the 2 and assign to list index x - 1, and more list index x.
I must be missing something obvious, because although I can produce the final unique-d data.table when I print it (eg., print (listelement[[1]]), when I return (listelement[[1]]) I get NULL. Would help if someone can spot what I am missing ... or suggest if there is perhaps any other more efficient way to perform this.
Also, instead of having to add each data.table to a list, can I add them as "references" in the list ? I believe doing something like list(datatable1, datatable2 ...) would actually copy them ?
## CODE
returnUnique2 <- function (alist) {
if (length(alist) == 1) {
z <- (alist[[1]])
print (class(z))
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
}
if (length(alist) %% 2 == 0) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
if (length(alist) %% 2 == 1 && length(alist) > 2) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
}
## OUTPUT with print statement
t1 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t2 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t3 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
tempList <- list(t1, t2, t3)
returnUnique2(tempList)
[1] "list"
[[1]]
col1 col2
1: a 3
2: a 2
3: a 5
4: a 9
5: a 10
6: a 7
7: a 1
8: a 8
9: a 4
10: a 6
Changing the following,
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
to read
return(z)
returns NULL
Thanks in advance.
Please correct me if I misunderstand what you're doing, but it sounds like you have one big data.table and are trying to split it up to run some function on it and would then combine everything back and run a unique on that. The data.table way of doing that would be to use by, e.g.
fn = function(d) {
# do whatever to the subset and return the resulting data.table
# in this case, do nothing
d
}
N = 10 # number of pieces you like
dt[, fn(.SD), by = (seq_len(nrow(dt)) - 1) %/% (nrow(dt)/N)][, seq_len := NULL]
dt = dt[!duplicated(dt)]
Seems like this could be a good use case for a for loop. With many rows the overhead of using a for loop should be relatively small compared to the computation time. I would try combining my data.table's into a list (called ll in my example), then for each one remove duplicated rows, then rbind to the previous data.table with unique rows and then subset by unique rows again.
If you have many duplicated rows in each chunk then this might save some time, overall I'm not sure how effective it will be, but worth a shot?
# Create empty data.table for results (I have columns x and y in this case)
res <- data.table( x= numeric(0),y=numeric(0))
# loop over all data.tables in a list called 'll'
for( i in 1:length(ll) ){
# rbind the unique rows from the current list element to the results from all previous iterations
res <- rbind( res , ll[[i]][ ! duplicated(ll[[i]]) , ] )
# Keep only unique records at each iteration
res <- res[ ! duplicated(res) , ]
}
On another note, have you looked at the documentation for data.table? It explicitly states,
Because data.tables are usually sorted by key, tests for duplication
are especially quick.
So you might just be better off running on the entire data.table?
DT[ ! duplicated(DT) , ]
Add an id column to each data.table
t1$id=1
t2$id=2
t3$id=3
then combine them all at once and do a unique using by=.
If the data.tables are huge you could use setkey(...) to create an index on id before calling unique.
tall=rbind(t1,t2,t3)
tall[,unique(col1,col2),by=id]