How to deal with memory issure in Ctree in party package? - r

I am using ctree method of the Party R package to generate a decision tree.
My dataset has about 22 columns and 650000 rows of data. I allocated 10GB of memory to my r session using memory.limit command.
I have a 2.3 GHz i3 processor and 6GB of RAM. what am i doing wrong here.
the error i get is
Calloc could not allocate memory (6223507 of 8 bytes)

Ok, I finally found some time to do this. It's not the too elegant, but should work.
At first, load the libraries and the function below (you'll need to install data.table package)
library(data.table)
library(party)
WeightFunc <- function(data, DV){
# Creating some paste function in order to paste unique paths
paste2 <- function(x) paste(x, collapse = ",")
ignore <- DV
# Creating unique paths
test3 <- apply(data[setdiff(names(data),ignore)], 1, paste2)
# Binding the unique paths vector back to the original data
data <- cbind(data, test3)
#data
# Getting the values of each explaining variable per each unique path
dt <- data.table(data[setdiff(names(data), ignore)])
dt.out <- as.data.frame(dt[, head(.SD, 1), by = test3])
# Creating dummy variables per each value of our dependable variable for further calculations
DVLvs <- as.character(unique(data[, DV]))
data[, DVLvs[1]] <- ifelse(data[, DV] == DVLvs[1], 1, 0)
data[, DVLvs[2]] <- ifelse(data[, DV] == DVLvs[2], 1, 0)
data[, DVLvs[3]] <- ifelse(data[, DV] == DVLvs[3], 1, 0)
# Summing dummy variables per unique path
dt <- data.table(data[c("test3", DVLvs)])
dt.out2 <- as.data.frame(dt[, lapply(.SD, sum), by = test3])
# Binding unique pathes with sums
dt.out2$test3 <- dt.out$test3 <- NULL
test <- cbind(dt.out, dt.out2)
# Duplicating the data in order to create a weights for every level of expalined variable
test2 <- test[rep(1:nrow(test),each = 3), ]
test2 <- cbind(test2, AdjDV = DVLvs)
test2$Weights <- ifelse(is.element(seq(1:nrow(test2)), grep("[.]1", rownames(test2))), test2[, DVLvs[2]],
ifelse(is.element(seq(1:nrow(test2)), grep("[.]2",rownames(test2))), test2[, DVLvs[3]], test2[, DVLvs[1]]))
# Deleting unseassery column
test2[, DVLvs[1]] <- test2[, DVLvs[2]] <- test2[, DVLvs[3]] <- NULL
return(test2)
}
Now run this function on your data set where data is your data and DV is your explained variable name (in quotes) and save it in a new dataset, for example:
Newdata <- WeightFunc(data = Mydata, DV = "Success")
Now, this process could take a while if you have many unique pathes, but it shouldn't overload your memory. If you don't have too many unique paths, this function should reduce your data set by tens or even hundred times. Also, this function is only good for 3 level factor explained variable (like you have).
After that, you can run the ctree as you were doing previously, but with the new data and the new explained variable (which will be called AdjDV) and wiegths parameter which called Weights. You'll also have to exclude Weights out of the dataset while running the ctree.
Like that:
ct <- ctree(AdjDV ~., data = Newdata[setdiff(names(Newdata), "Weights")], weights = Newdata$Weights)

Related

How to efficiently split each row into test and train subsets using R?

I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]

Loop through (subsets) using jags

I have a big dataframe with 10000 rows and 12 columns (discountdataset).
The columns contain different variables. The first 210 rows represents subject 1 (there is also a column with "subject1"), the next 210 rows represent subject 2, and so on.
I want to use jags and a loop function to loop through all 52 subjects in the dataframe, and assign a function to each of them. My code looks like this:
#subsetting the dataframe by the variable subjectid
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
Here my plan is to loop and assign the following jags function to all subjects in the subset), but, it doesn't work. I think my mistake is that the variables "nt", "Choice" that I want to pass on to jags are not defined right, or, are not updated.
library(rjags)
for (i in 1:length(subsetdiscount))
{
nt <- nrow (subsetdiscount)
Choice <- subsetdiscount$choice
amountSS <- subsetdiscount$val_basic
amountLL <- subsetdiscount$val_d
delayDIFF <- subsetdiscount$delay
con <- subsetdiscount$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(
list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt", n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
Try:
library(rjags)
library(R2jags)
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
output_models <- lapply(subsetdiscount, function(x) {
nt <- nrow(x)
Choice <- x$choice
amountSS <- x$val_basic
amountLL <- x$val_d
delayDIFF <- x$delay
con <- x$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt",
n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
return(samples)
})
output_models should be a list containing outputs for each of the factors you split main dataset by.
Please note that it is quite hard to test this without any provided data. So, if this fails to work, you may want to provide some data for testing.
I hope it helps.

Evaluate code within a function call in R (Use ICC::ICCbare within a loop)

I want to use the ICC::ICCbare function within a loop. However, the ICCbare uses the concrete variable names as input, e.g.:
ICCbare(x = group, y = variable1, data = dat)
whereby both "group" and "variable1" are columns of the data.frame "dat" (i.e., dat$variable1); ICCbarecannot be used with y = dat[, i].
In order to program a loop I therefore need to evaluate some R code within the function call of ICCbare. My idea was the following:
for(i in 1:10){
ICCbare(group, names(dat)[i], data = dat)
}
However, this does not work. The following error is printed:
Error in '[.data.frame`(data, yc) : undefined columns selected'
Is there a way to evaluate the statement names(dat)[i]) first before it is passed to the function call?
Here is a minimum working example for my problem:
# Create data set
dat <- data.frame(group=c(rep("A",5),
rep("B",5)),
variable1=1:10,
variable2=rnorm(10))
# Loop
for (i in names(dat)[2:3]){
ICCbare("group", i, data = dat)
}
I agree with #agstudy. This is a bad example of non-standard evaluation. You can use this as a workaround:
v <- "variable1"
ICCbare("group", v, data = dat)
#Error in `[.data.frame`(data, yc) : undefined columns selected
eval(bquote(ICCbare("group", .(v), data = dat)))
#$ICC
#[1] 0.8275862
It is a bug in ICCbare that try to to manage arguments as name in a bad manner.
function (x, y, data)
{
ICCcall <- Call <- match.call()
xc <- as.character(ICCcall[[2L]]) ## this is ugly!
yc <- as.character(ICCcall[[3L]])
inds <- unique(data[xc])[[1]]
tdata <- data.frame(data[yc], data[xc])
Personally I would remove the first lines and just use assume that arguments are just column names.
ICCbare_simple <-
function (xc, yc, data)
{
## remove lines before this one
inds <- unique(data[xc])[[1]]
## the rest of the code
.....
}
I'm the maintainer of ICC and I want to thank you for the excellent discussion. I know this is a very late reply, but I just updated the package and the new version (v2.3.0) should fix the "ugly" code and the problem encountered by the OP. See examples in this gist.
I just wanted to post this here in case anyone was searching with a similar problem. Thanks again, sorry for the delay.
Here is the content of the gist:
ICC non-standard evaluation examples
The ICC package for R calculates the intraclass correlation coefficient (ICC) from a one-way analysis of variance. Recently, the package was updated to better execute R's non-standard evaluation within each function (version 2.3.0 and higher). The package functions should now be able to handle a range of possible scenarios for calling the functions in, what I hope, is a less grotesque and more standard way of writing R functions. To demonstrate, below are some of those scenarios. Note, the examples use the ICCbare function, but the way in which the function arguments are supplied will apply to all of the functions in ICC.
First, load the package (and make sure the version is >2.3.0)
library(ICC)
packageVersion("ICC")
Columns of a data.frame
Here we supply the column names and the data.frame that contains the data to calculate the ICC. We will use the ChickWeight data fame.
data(ChickWeight)
ICCbare(x = Chick, y = weight, data = ChickWeight)
#$ICC
#[1] 0.1077609
Iterating through columns of a data.frame
In this case, we might have a data.frame in which we want to estimate the ICC for a number of different types of measurements that each has the same grouping or factor variable (e.g., x). The extreme of this might be in a simulation or bootstrapping scenario or even with some fancy high-throughput phenotyping/data collection. The point being, we want to automate the calculation of the ICC for each column.
First, we will simulate our own dataset with 3 traits to use in the example:
set.seed(101)
n <- 15 # number of individuals/groups/categories/factors
k <- 3 # number of measures per 'n'
va <- 1 # variance among
icc <- 0.6 # expected ICC
vw <- (va * (1 - icc)) / icc # solve for variance within
simdf <- data.frame(ind = rep(LETTERS[1:n], each = k),
t1 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t2 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t3 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)))
Two ways to run through the columns come to mind: iteratively pass the name of each column or iteratively pass the column index. I will demonstrate both below. I do these in for loops so it is easier to see, but an easy extension would be to vectorise this by using something from the apply family of functions. First, passing the name:
for(i in names(simdf)[-1]){
cat(i, ":")
tmp.icc <- ICCbare(x = ind, y = i, data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
or even like this:
for(i in 1:3){
cat(paste0("t", i), ": ")
tmp.icc <- ICCbare(x = ind, y = paste0("t", i), data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Alternatively, pass the column index:
for(i in 2:ncol(simdf)){
cat(names(simdf)[i], ": ")
tmp.icc <- ICCbare(x = ind, y = simdf[, i], data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Passing a character as an argument is deprecated
Note that the function will still work if a character is passed directly (e.g., "t1"), albeit with a warning. The warning just means that this may no longer work in future versions of the package. For example:
ICCbare(x = ind, y = "t1", data = simdf)
#[1] 0.60446
#Warning message:
#In ICCbare(x = ind, y = "t1", data = simdf) :
# passing a character string to 'y' is deprecated since ICC version
# 2.3.0 and will not be supported in future versions. The argument
# to 'y' should either be an unquoted column name of 'data' or an object
Note, however, that an expression evaluating to a character (e.g., paste0("t", 1)) doesn't throw the warning, which is nice!

gene ranking microarray

I wonder if anyone is familiar with the Bioconductor RankProduct package for ranking and obtaining differentially expressed genes. Some info about the software are as follows paper, manual, documentation.
I ran into some problems while using the program, maybe because of my little knowledge of R language. I tried to replicate the steps in the pdf files above with my own data. Although my own datasets were not in the afffy .cel files as in the examples, but only as rows and columns in a tab-delimited file. I have two conditions (1 and 2, replicate = 4 for each)
Here is my code:
library(RankProd)
library(preprocessCore)
#Read expression data
#gdata <- read.table(file="data2.txt", sep="\t", header=T) #9000 rows of genes X 8 columns of chips
gdata <- read.table(file="data2.txt", sep="\t", header=T, row.names=1) #9000 rows of genes X 8 columns of chips
#colnames(gdata)
# This vector contains the microarray sample names
SampleNames= names(data.frame(gdata[,-1]))
#names(datExpr)=gdata[,1]
# This vector contains the gene names
datExpr.gnames= gdata$GeneName
# Since the first column contains the gene names, exclude it.
# dataExp is then the matix required
datExpr=data.frame(gdata[,-1])
#convert data into matrix form
datExpr <- as.matrix(datExpr)
#data normalization - quantile normalization
#datExpr.log.norm <- normalize.quantiles((log2(datExpr)),copy=TRUE) #with logged data
datExpr <- datExpr.log.norm
#datExpr.norm <- normalize.quantiles(datExpr,copy=TRUE) #without logged data
#datExpr <- datExpr.norm
# Identify two class data - control/treatment (or condition 1/condition2)
nl <- 4
n2 <- 4
cl <- rep(c(0,1), c(nl, n2))
datExpr.cl <- cl
# data were generated under identical or very similar conditions except the
# factor of interest (e.g., control and treatment),
origin <- rep(1, nl + n2)
datExpr.origin <- origin
# Data anslysis
datExpr.sub <- datExpr[,which(datExpr.origin == 1)]
datExpr.cl.sub <- datExpr.cl[which(datExpr.origin == 1)]
datExpr.origin.sub <- datExpr.origin[which(datExpr.origin == 1)]
#Rank product analysis and output
#RP.out <- RP(datExpr.sub, datExpr.cl.sub, num.perm = 100, logged = TRUE,na.rm = FALSE, plot = FALSE, rand = 123)
RP.out <- RPadvance(datExpr.sub, datExpr.cl.sub, datExpr.origin.sub, num.perm = 100,logged = TRUE,
na.rm = FALSE, gene.names = datExpr.gnames, plot = FALSE,rand = 123)
# Output a table of the identified genes based on user-specified selection criteria
topGene(RP.out, cutoff = 0.05, method = "pfp", logged = TRUE,logbase = 2, gene.names = datExpr.gnames)
I did run the code, but my fold changes for differentially expressed genes in one condition VS the other were either 0's or infinities. I wonder if anyone with experience with this program can help me.
At a first glance what I note is that
#datExpr.log.norm <- normalize.quantiles((log2(datExpr)),copy=TRUE) #with logged data
datExpr <- datExpr.log.norm
Here as long as the first line is commented out datExpr will result empty.

Subsetting data by condition

I am trying to reshape/ reduce my data. So far, I employ a for loop (very slow) but from what I perceive, this should be quite fast with Plyr.
I have many groups (firms, as a factor in the dataset) and I want to drop entirely every firm which shows a 0 entry for value in any of that firm's cells. I thus create a new data.frame but leave out all groups showing 0 for value at some point.
The forloop:
Data Creation:
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
-----------------------------
splitby = mydf$firmname
new.data <- data.frame()
for (i in 1:(length(unique(splitby)))) {
temp <- subset(mydf, splitby == as.character(paste(unique(splitby)[i])))
if (all(temp$value > 0) == "TRUE") {
new.data <- rbind(new.data, temp)
}
}
Delete all empty firm factors
new.data$splitby <- factor(new.data$splitby)
Is there a way to achieve that with the plyr package? Can the subset function be used in that context?
EDIT: For the purpose of the reproduction of the problem, data creation, as suggested by BenBarnes, is added. Ben, thanks a lot for that. Furthermore, my code is altered so as to comply with the answers provided below.
You could supply an anonymous function to the .fun argument in ddply():
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
library(plyr)
ddply(mydf,.(firmname), function(x) if(any(x$value==0)) NULL else x )
Or using [, as suggested by Andrie:
firms0 <- unique(mydf$firmname[which(mydf$value == 0)])
mydf[-which(mydf$firmname %in% firms0), ]
Note that the results of ddply are sorted according to firmname
EDIT
For the example in your comments, this approach is again faster than using ddply() to subset, selecting only firms with more than three entries:
firmTable <- table(mydf$firmname)
firmsGT3 <- names(firmTable)[firmTable > 3]
mydf[mydf$firmname %in% firmsGT3, ]

Resources