I would like to run a Dirichlet regression on a large data set using the DirichReg Package in R. I currently have data.frame with 37 columns and ~13,000,000 rows.
However, running this model on all of my data instantly crashes R. I am using a Linux machine with 16 cores and 128 GB of memory. Even just cutting down my data to only 1000 points still causes R to almost immediately crash and restart.
Am I doing something wrong? Is there any way I can parallelize this operation to get this model to run?
I am running a model with the following syntax:
data.2 <- data
data.2$y_variable <- DR_data(data[,c(33:35)])
model <- DirichReg(y_variable ~ x_variable, data.2)
I have to create the y_variable in a separate data.2 data.frame, because running data$y_variable <- DR_data(data[,c(33:35)]) will crash R. I have no idea why this is.
Bit of a guess why it's 'crashing' R, but if it's due to RAM issues then you can update the table by reference, rather than making a shallow copy of the entire data:
library(data.table)
setDT(data)
dat[, y := DR_data(data[,c(33:35)])]
Related
I have a large dataset, including about 100,000 entries. I am using the tibbletime package to create a rolling version of the DL.test function from the vrtest package.
I am using a rolling window (size=1000), leading to about 99,000 computations. The code looks like this:
#installing packages
install.packages("tibbletime")
install.packages("vrtest")
#importing libraries
library(vrtest)
library(dplyr)
library(tibbletime)
library(tibble)
#generating demo data
data <- data.frame(replicate(1,sample(0:1,1010,rep=TRUE)))
names(data)[names(data) == "replicate.1..sample.0.1..1010..rep...TRUE.."] <- "log_return"
#running DL.test once
DL.test(data, 300, 1)
#creating a rolling window version of DL.test
test <- rollify(DL.test, window=1000, unlist=FALSE)
#applying function and saving results
results <- dplyr::mutate(data, test = test(log_return))
The issue now is that running DL.test even once takes a little less than 5 minutes on my current setup. Having to repeat this step nearly 100,000 times limits the practicality fairly strong.
What options do I have to speed this process up?
My current idea would be to create many smaller versions of my original dataset (e.g., entries 1 - 1500 for the first 500 computations, 501 - 2000 for the second batch...) and somehow employ parallel processing.
Any hints are highly appreciated!
I am trying to read a dataset in SVMLight format into h2o. Writing it to a file on disk and reading it back is working OK but reading it directly from R's memory is not. I would like to know if there is a different function or a different way of calling the function I have used below.
Here's an example R 3.3.3, h2o 3.10.3.6:
require(data.table)
require(h2o)
set.seed(1000)
tot_obs <- 100
tot_var <- 500
vars_per_obs <- round(.0*tot_var,0):round(.1*tot_var,0)
#randomly generated data
mat.dt <- do.call('rbind', lapply(1:tot_obs, function(n) {
nvar <- sample(vars_per_obs,1)
if(nvar>0) data.table(obs=n, var=sample(1:tot_var,nvar))[, value:=sample(10:50,.N,replace=TRUE)]
}))
event.dt <- data.table(obs=1:tot_obs)[, is_event:=sample(0:1,.N,prob=c(.9,.1),replace=TRUE)]
#SVMLight format
setorder(mat.dt, obs, var)
mat.agg.dt <- mat.dt[, .(feature=paste(paste0(var,":",value), collapse=" ")), obs]
mat.agg.dt <- merge(event.dt, mat.agg.dt, by="obs", sort=FALSE, all.x=TRUE)
mat.agg.dt[is.na(feature), feature:=""]
mat.agg.dt[, svmlight:=paste(is_event,feature)][, c("obs","is_event","feature"):=NULL]
fwrite(mat.agg.dt, file="svmlight.txt", col.names=FALSE)
#h2o
localH2o <- h2o.init(nthreads=-1, max_mem_size="4g")
h2o.no_progress()
#works
h2o.orig <- h2o.importFile("svmlight.txt", parse=TRUE)
#does NOT work
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")
The easy answer is that you probably don't have enough R memory to perform this action, so one solution is to increase the amount of memory in R (if that's an option for you). It could also mean that you don't have enough memory in your H2O cluster, so you could increase that as well.
The only way to go directly from R memory to the H2O cluster is the as.h2o() function, so you are definitely using the right command. Under the hood, the as.h2o() function writes the frame from R memory to disk (stored in a temp file) and then reads it directly into the H2O cluster using H2O's native parallel read functionality.
We recently added the ability to use data.table's read/write functionality any place that we use base R, so since you have data.table installed, you should probably be able to get around this bottleneck by adding this to the top of your script: options("h2o.use.data.table"=TRUE). This will force the use of data.table instead of base R to write to disk for the first half of the as.h2o() conversion process. This should work for you since it's doing the exact same thing that your code is doing already where you use fwrite to write to disk and h2o.importFile() to read it back in.
Also you don't need the last line with h2o.parseRaw():
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")
You can just do:
h2o.orig.1 <- as.h2o(mat.agg.dt)
There is a related post that shows how to use data.table to solve the reverse problem (using as.data.frame() instead of as.h2o()) here.
I am using h2o to carry out some modelling, and having tuned the model, i would now like it to be used to carry out a lot of predictions approx 6bln predictions/rows, per prediction row it needs 80 columns of data
The dataset I have already broken down the input dataset down so that it is in about 500 x 12 million row chunks each with the relevant 80 columns of data.
However to upload a data.table that is 12 million by 80 columns to h2o takes quite a long time, and doing it 500 times for me is taking a prohibitively long time...I think its because it is parsing the object first before it is uploaded.
The prediction part is relatively quick in comparison....
Are there any suggestions to speed this part up? Would changing the number of cores help?
Below is an reproducible example of the issues...
# Load libraries
library(h2o)
library(data.table)
# start up h2o using all cores...
localH2O = h2o.init(nthreads=-1,max_mem_size="16g")
# create a test input dataset
temp <- CJ(v1=seq(20),
v2=seq(7),
v3=seq(24),
v4=seq(60),
v5=seq(60))
temp <- do.call(cbind,lapply(seq(16),function(y){temp}))
colnames(temp) <- paste0('v',seq(80))
# this is the part that takes a long time!!
system.time(tmp.obj <- as.h2o(localH2O,temp,key='test_input'))
#|======================================================================| 100%
# user system elapsed
#357.355 6.751 391.048
Since you are running H2O locally, you want to save that data as a file and then use:
h2o.importFile(localH2O, file_path, key='test_intput')
This will have each thread read their parts of the file in parallel. If you run H2O on a separate server, then you would need to copy the data to a location that the server can read from (most people don't set the servers to read from the file system on their laptops).
as.h2o() serially uploads the file to H2O. With h2o.importFile(), the H2O server finds the file and reads it in parallel.
It looks like you are using version 2 of H2O. The same commands will work in H2Ov3, but some of the parameter names have changed a little. The new parameter names are here: http://cran.r-project.org/web/packages/h2o/h2o.pdf
Having also struggled with this problem, I did some tests and found that for objects in R memory (i.e. you don't have the luxury of already having them available in .csv or .txt form), by far the quickest way to load them (~21 x) is to use the fwrite function in data.table to write a csv to disk and read it using h2o.importFile.
The four approaches I tried:
Direct use of as.h2o()
Writing to disk using write.csv() then load using h2o.importFile()
Splitting the data in half, running as.h2o() on each half, then combining using h2o.rbind()
Writing to disk using fwrite() from data.table then load using h2o.importFile()
I performed the tests on a data.frame of varying size, and the results seem pretty clear.
The code, if anyone is interested in reproducing, is below.
library(h2o)
library(data.table)
h2o.init()
testdf <-as.data.frame(matrix(nrow=4000000,ncol=100))
testdf[1:1000000,] <-1000 # R won't let me assign the whole thing at once
testdf[1000001:2000000,] <-1000
testdf[2000001:3000000,] <-1000
testdf[3000001:4000000,] <-1000
resultsdf <-as.data.frame(matrix(nrow=20,ncol=5))
names(resultsdf) <-c("subset","method 1 time","method 2 time","method 3 time","method 4 time")
for(i in 1:20){
subdf <- testdf[1:(200000*i),]
resultsdf[i,1] <-100000*i
# 1: use as.h2o()
start <-Sys.time()
as.h2o(subdf)
stop <-Sys.time()
resultsdf[i,2] <-as.numeric(stop)-as.numeric(start)
# 2: use write.csv then h2o.importFile()
start <-Sys.time()
write.csv(subdf,"hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,3] <-as.numeric(stop)-as.numeric(start)
# 3: Split dataset in half, load both halves, then merge
start <-Sys.time()
length_subdf <-dim(subdf)[1]
h2o1 <-as.h2o(subdf[1:(length_subdf/2),])
h2o2 <-as.h2o(subdf[(1+length_subdf/2):length_subdf,])
h2o.rbind(h2o1,h2o2)
stop <-Sys.time()
resultsdf[i,4] <- as.numeric(stop)-as.numeric(start)
# 4: use fwrite then h2o.importfile()
start <-Sys.time()
fwrite(subdf,file="hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,5] <-as.numeric(stop)-as.numeric(start)
plot(resultsdf[,1],resultsdf[,2],xlim=c(0,4000000),ylim=c(0,900),xlab="rows",ylab="time/s",main="Scaling of different methods of h2o frame loading")
for (i in 1:3){
points(resultsdf[,1],resultsdf[,(i+2)],col=i+1)
}
legendtext <-c("as.h2o","write.csv then h2o.importFile","Split in half, as.h2o and rbind","fwrite then h2o.importFile")
legend("topleft",legend=legendtext,col=c(1,2,3,4),pch=1)
print(resultsdf)
flush.console()
}
I have a large data.frame of 20M lines. This data frame is not only numeric, there is characters as well. Using a split and conquer concept, I want to split this data frame to be executed in a parallel way using snow package (parLapply function, specifically). The problem is that the nodes run out of memory because the data frame parts are worked in RAM. I looked for a package to help me with this problem and I found just one (considering the multi type data.frame): ff package. Another problem comes from the use of this package. The split result of a ffdf is not equal to a split of a commom data.frame. Thus, it is not possible to run the parLapply function.
Do you know other packages for this goal? Bigmemory only supports matrix.
I've benchmarked some ways of splitting the data frame and parallelizing to see how effective they are with large data frames. This may help you deal with the 20M line data frame and not require another package.
The results are here. The description is below.
This suggests that for large data frames the best option is (not quite the fastest, but has a progress bar):
library(doSNOW)
library(itertools)
# if size on cores exceeds available memory, increase the chunk factor
chunk.factor <- 1
chunk.num <- kNoCores * cut.factor
tic()
# init the cluster
cl <- makePSOCKcluster(kNoCores)
registerDoSNOW(cl)
# init the progress bar
pb <- txtProgressBar(max = 100, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
# conduct the parallelisation
travel.queries <- foreach(m=isplitRows(coord.table, chunks=chunk.num),
.combine='cbind',
.packages=c('httr','data.table'),
.export=c("QueryOSRM_dopar", "GetSingleTravelInfo"),
.options.snow = opts) %dopar% {
QueryOSRM_dopar(m,osrm.url,int.results.file)
}
# close progress bar
close(pb)
# stop cluster
stopCluster(cl)
toc()
Note that
coord.table is the data frame/table
kNoCores ( = 25 in this case) is the number of cores
Distributed memory. Sends coord.table to all nodes
Shared memory. Shares coord.table with nodes
Shared memory with cuts. Shares subset of coord.table with nodes.
Do par with cuts. Sends subset of coord.table to nodes.
SNOW with cuts and progress bar. Sends subset of coord.table to nodes
Option 5 without progress bar
More information about the other options I compared can be found here.
Some of these answers might suit you, although they doesn't relate to distributed parlapply and I've included some of them in my benchmarking options.
I have attempted to email the author of this package without success,
just wondering if anybody else has experienced this.
I am having an using rpart on 4000 rows of data with 13 attributes.
I can run the same test on 300 rows of the same data with no issue.
When I run on 4000 rows, Rgui.exe runs consistently at 50% CPU and the
UI hangs; it will stay like this for at least 4-5hours if I let it
run, and never exit or become responsive.
here is the code I am using both on the 300 and 4000 size subset:
train <- read.csv("input.csv", header=T)
y <- train[, 18]
x <- train[, 3:17]
library(rpart)
fit <- rpart(y ~ ., x)
Is this a known limitation of rpart, am I doing something wrong?
potential workarounds?
Can you reproduce the error message when you feed rpart random data of similar dimensions, rather than your real data (from input.csv)? If not, it's probably a problem with your data (formatting perhaps?). After importing your data using read.csv, check the data for format issues by looking at the output from
str(train).
#How to do an equivalent rpart fit one some random data of equivalent dimension
dats<-data.frame(matrix(rnorm(4000*14), nrow=4000))
y<-dats[,1]
x<-dats[,-1]
library(rpart)
system.time(fit<-rpart(y~.,x))
the problem here was data prep error.
a header was re-written far down in the middle of the data set.