data.table size and datatable.alloccol option - r

The dataset I am working on is not very big, but quite wide. I tcurrently has 10 854 columns and I would like to add approximately another 10/11k columns. It has only 760 rows.
When I try (applying functions to a subset of the existing columns), I get the following
Warning message:
In `[.data.table`(setDT(Final), , `:=`(c(paste0(vars, ".xy_diff"), :
truelength (30854) is greater than 10,000 items over-allocated (length = 10854). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo().
I have tried to play with setalloccol, but I get something similar. For example:
setalloccol(Final, 40960)
Error in `[.data.table`(x, i, , ) :
getOption('datatable.alloccol') should be a number, by default 1024. But its type is 'language'.
In addition: Warning message:
In setalloccol(Final, 40960) :
tl (51894) is greater than 10,000 items over-allocated (l = 21174). If you didn't set the datatable.alloccol option to be very large, please report to data.table issue tracker including the result of sessionInfo().
Is there a way to bypass this problem?
Thanks a lot
Edit:
to answer Roland's comment, here is what I am doing:
vars <- c(colnames(FinalTable_0)[271:290], colnames(FinalTable_0)[292:dim(FinalTable_0)[2]]) # <- variables I want to operate on
# FinalTable_0 is a previous table I use to collect the roots of the variables I want to work with
difference <- function(root) lapply(root, function(z) paste0("get('", z, ".x') - get('", z, ".y')"))
ratio <- function(root) lapply(root, function(z) paste0("get('", z, ".x') / get('", z, ".y')"))
# proceed to the computation
setDT(Final)[ , c(paste0(vars,".xy_diff"), paste0(vars,".xy_ratio")) := lapply(c(difference(vars), ratio(vars)), function(x) eval(parse(text = x)))]

I tried the solution proposed by Roland, but was not fully satisfied. It works, but I do not like the idea of transposing my data.
In the end, I just split the original data.table into multiple ones, proceeded to the computations on each individually and joined back at the end. Fast and simple, no need to play with variables, tell which ones are ids and which are measures, no need to shape and reshape. I just prefer.

Related

R: Efficiently Calculate Deviations from the Mean Using Row Operations on a DF (Without Using a For Loop)

I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?

Error with knnImputer from the DMwR Package: invalid 'times' argument

I'm trying to run knnImputer from the DMwR package on a genomic dataset. The dataset has two columns - one for location on a chromosome (numeric, an integer) and one for methylation values (also numeric, double), with many of the methylation values are missing. The idea is that distance should be based on location in the chromosome. I also have several other features, but chose to not include those). When I run the following line however, I get an error.
reg.knn <- knnImputation(as.matrix(testp), k=2, meth="median")
#ERROR:
#Error in rep(1, ncol(dist)) : nvalid 'times' argument
Any thoughts on what could be causing this?
If this doesn't work, does anyone know of anything other good KNN Imputers in R packages? I've been trying several but each returns some kind of error.
I got a similar error today:
Error in rep(1, ncol(dist)) : invalid 'times' argument
I could not find a solution online but with some trail and error , I think the issue is with no. of columns in data frame
Try passing at least '3' columns and do KNNimputation
I created a dummy column which gives ROW count of the observation (as third column).
It worked for me !
Examples for your reference:
Example 1 -
temp <- data.frame(X = c(1,2,3,4,5,6,7,8,9,10), Y = c(T, T, F, F,F,F,NA,NA,T,T))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Error in rep(1, ncol(dist)) : invalid 'times' argument
Example 2 -
temp <- data.frame(X = 1:10, Y = c(T, T, F, F,F,F,NA,T,T,T), Z = c(NA,NA,7,8,9,5,11,9,9,4))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Here number of columns passed is 3. Did NOT get any error!
Today, I encountered the same error. My df was much larger than 3 columns, so this seems to be not the (only?) problem.
I found that rows with too much NAs caused the problem (in my case, more than 95% of a given row was NA). Filtering out this row solved the problem.
Take home message: do not only filter for NAs over the columns (which I did), but also check the rows (it's of course impossible to impute by kNN if you cannot define what exactly is a nearest neighbor).
Would be nice if the package would provide a readable error message!
When I read into the code, I located the problem, if the column is smaller than 3, then in the process it where down-grade to something which is not a dataframe and thus the operation based on dataframe structure all fails, I think the author should handle this case.
And yes, the last answer also find it by trial, different road, same answer

How to quickly split values in column to create a table for plotting in R

I was wondering if anyone could offer any advice on speeding
the following up in R.
I’ve got a table in a format like this
chr1, A, G, v1,v2,v3;w1w2w3, ...
...
The header is
chr, ref, alt, sample1, sample2 ...(many samples)
In each row for each sample I’ve got 3 values for v and 3 values for w,
separated by “;"
I want to extract v1 and w1 for each sample make a table
that can be plotted using ggplot, it would look like this
chr, ref, alt, sam, v1, w1
I am doing this by strsplit and rbind one by one like the
following
varsam <- c()
for(i in 1:n.var){
chrm <- variants[i,1]
ref <- as.character(variants[i,3])
alt <- as.character(variants[i,4])
amp <- as.character(variants[i,5])
for(j in 1:n.sam){
vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
vsc <- strsplit(vs[1], split=",")[[1]]
vsp <- strsplit(vs[2], split=",")[[1]]
varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}
This is very slow as you would expect. Any idea how to speed this up?
As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:
Create some timings
Play around with different aspects of your code to see where the main time is being used.
Basic timing analysis can be done with system.time() method to help with performance analysis
Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.
the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
You may need to install.packages("dplyr") if not already installed.
In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table
As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.
Thanks for the comments. I think I've found way to improve this.
I used melt in "reshape" to firstly convert my input table to
chr, ref, alt, variable
I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.

R data.table replacing an index of values from another data.table

Hi still trying to figure out data.table. If I have a data.table of values such as those below, what is the most efficient way to replace the values with those from another data.table?
set.seed(123456)
a=data.table(
date_id = rep(seq(as.Date('2013-01-01'),as.Date('2013-04-10'),'days'),5),
px =rnorm(500,mean=50,sd=5),
vol=rnorm(500,mean=500000,sd=150000),
id=rep(letters[1:5],each=100)
)
b=data.table(
date_id=rep(seq(as.Date('2013-01-01'),length.out=600,by='days'),5),
id=rep(letters[1:5],each=600),
px=NA_real_,
vol=NA_real_
)
setkeyv(a,c('date_id','id'))
setkeyv(b,c('date_id','id'))
What I'm trying to do is replace the px and vol in b with those in a where date_id and id match I'm a little flummoxed with this - I would suppose that something along the lines of might be the way to go but I don't think this will work in practice.
b[which(b$date_id %in% a$date_id & b$id %in% a$id),list(px:=a$px,vol:=a$vol)]
EDIT
I tried the following
t = a[b,roll=T]
t[!is.na(px),list(px.1:=px,vol.1=vol),by=list(date_id,id)]
and got the error message
Error in `:=`(px.1, px) :
:= is defined for use in j only, and (currently) only once; i.e., DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}]. Please see help(":="). Check is.data.table(DT) is TRUE.
If you are wanting to replace the values within b you can use the prefix i.. From the NEWS regarding version 1.7.10
The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.
b[a, `:=`(px = i.px, vol = i.vol)]
Doesn't sound like you need the roll from your description, and it seems like you want to do this instead when you get your error:
t[!is.na(px),`:=`(px.1=px,vol.1=vol),by=list(date_id,id)]

Porting set operations from R's data frames to data tables: How to identify duplicated rows?

[Update 1: As Matthew Dowle noted, I'm using data.table version 1.6.7 on R-Forge, not CRAN. You won't see the same behavior with an earlier version of data.table.]
As background: I am porting some little utility functions to do set operations on rows of a data frame or pairs of data frames (i.e. each row is an element in a set), e.g. unique - to create a set from a list, union, intersection, set difference, etc. These mimic Matlab's intersect(...,'rows'), setdiff(...,'rows'), etc., which don't appear to have counterparts in R (R's set operations are limited to vectors and lists, but not rows of matrices or data frames). Examples of these little functions are below. If this functionality for data frames already exists in some package or base R, I'm open to suggestions.
I have been migrating these to data tables and one necessary step in the current approach is to find duplicated rows. When duplicated() is executed an error is returned stating that data tables must have keys. This is an unfortunate roadblock - other than setting keys, which isn't a universal solution and adds to computational costs, is there some other way to find duplicated objects?
Here is a reproducible example:
library(data.table)
set.seed(0)
x <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
y <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
res3 <- dt_intersect(x,y)
Yielding this error message:
Error in duplicated.data.table(z_rbind) : data table must have keys
The code works as-is for data frames, though I've named each function with the pattern dt_operation.
Is there some way to get around this issue? Setting keys only works for integers, which is a constraint I can't assume for the input data. So, perhaps I'm missing a clever way to use data tables?
Example set operation functions, where the elements of the sets are rows of data:
dt_unique <- function(x){
return(unique(x))
}
dt_union <- function(x,y){
z_rbind <- rbind(x,y)
z_unique <- dt_unique(z_rbind)
return(z_unique)
}
dt_intersect <- function(x,y){
zx <- dt_unique(x)
zy <- dt_unique(y)
z_rbind <- rbind(zy,zx)
ixDupe <- which(duplicated(z_rbind))
z <- z_rbind[ixDupe,]
return(z)
}
dt_setdiff <- function(x,y){
zx <- dt_unique(x)
zy <- dt_unique(y)
z_rbind <- rbind(zy,zx)
ixRangeX <- (nrow(zy) + 1):nrow(z_rbind)
ixNotDupe <- which(!duplicated(z_rbind))
ixDiff <- intersect(ixNotDupe, ixRangeX)
diffX <- z_rbind[ixDiff,]
return(diffX)
}
Note 1: One intended use for these helper functions is to find rows where key values in x are not among the key values in y. This way, I can find where NAs may appear when calculating x[y] or y[x]. Although this usage allows for setting of keys for the z_rbind object, I'd prefer not to constrain myself to just this use case.
Note 2: For related posts, here is a post on running unique on data frames, with excellent results for running it with the updated data.table package.
And this is an earlier post on running unique on data tables.
duplicated.data.table needs the same fix unique.data.table got [EDIT: Now done in v1.7.2]. Please raise another bug report: bug.report(package="data.table"). For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.
But, on Note 1, there's a 'not join' idiom :
x[-x[y,which=TRUE]]
See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.
Update. Now in v1.8.3, not-join has been implemented.
DT[-DT["a",which=TRUE,nomatch=0],...] # old idiom
DT[!"a",...] # same result, now preferred.

Resources