Memory error when iterate over two dataloaders simultaneously in pytorch - out-of-memory

I am trying to train my model using 2 dataloaders from 2 different datasets.
I found how to set up this by using cycle() and zip() because my datasets are not the same length from here: How to iterate over two dataloaders simultaneously using pytorch?
File "/home/Desktop/example/train.py", line 229, in train_2
for i, (x1, x2) in enumerate(zip(cycle(train_loader_1), train_loader_2)):
File "/home/.conda/envs/3dcnn/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 346, in __next__
data = self.dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/.conda/envs/3dcnn/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/.conda/envs/3dcnn/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/.conda/envs/3dcnn/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/home/.conda/envs/3dcnn/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 56, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 154140672 bytes. Error code 12 (Cannot allocate memory)
I tried to solve that by setting num_workers=0, decreasing the batch size, using pinned_memory=False and shuffle=False...
But none of it worked... I am having 256GB of RAM and 4 NVIDIA TESLA V100 GPUs.
I tried to run it just by not training in 2 dataloaders simultaneously but individually and it worked. However for my project I need this parallel training with 2 datasets...

Based on this discussion, instead of cycle() and zip() I avoid any errors by using:
try:
data, target = next(dataloader_iterator)
except StopIteration:
dataloader_iterator = iter(dataloader)
data, target = next(dataloader_iterator)
kudos to #srossi93 from this pytorch post!

Related

How to fix memory allocation issues when converting annotated NLP model to dataframe in R

I am trying to convert an annotated NLP model of size 1.2GB to dataframe. I am using the Udpipe package for natural language processing in R with following code:
# Additional Topic Models
# annotate and tokenize corpus
model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(model$file_model)
s <- udpipe_annotate(udmodel_english, cleaned_text_NLP)
options(java.parameters = "-Xmx32720m")
memory.limit(3210241024*1024)
x <- data.frame(s)
Note that I have 32GB RAM and allocated all available memory to R to run the code. I also tried deleting large objects stored in the R environment space that are not relevant for running the above code. R cannot seem to allocate enough memory for the task and the following error message was the result:
Error in strsplit(x$conllu, "\n") :
could not allocate memory (4095 Mb) in C function 'R_AllocStringBuffer'
My question is two fold:
What does the above error message mean?
What workarounds are available to fix this issue?
Probably you have quite some documents to annotate. It's better to annotate in chunks as shown at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html
Following code will annotate in chunks of 50 documents in parallel across 2 cores and basically does your data.frame command. You will no longer have the issue as the function did strsplit on each chunks of 50 documents instead of on your full dataset where apparently the size of the annotated text was too large to fit into the limits of a stringbuffer in R. But below code will solve your issue.
x <- udpipe(cleaned_text_NLP, udmodel_english, parallel.cores = 2L, parallel.chunksize = 50)

Error in { : task 1 failed - "error returned from C call" using ncvar_get (ncdf4 package) within foreach loop

I am trying to extract data from a .nc file. Since there are 7 variables in my file, I want to loop the ncvar_get function through all 7 using foreach.
Here is my code:
# EXTRACTING CLIMATE DATA FROM NETCDF4 FILE
library(dplyr)
library(data.table)
library(lubridate)
library(ncdf4)
library(parallel)
library(foreach)
library(doParallel)
# SET WORKING DIRECTORY
setwd('/storage/hpc/data/htnb4d/RIPS/UW_climate_data/')
# SETTING UP
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
# READING INPUT FILE
infile <- nc_open("force_SERC_8th.1979_2016.nc")
vars <- attributes(infile$var)$names
climvars <- vars[1:7]
# EXTRACTING INFORMATION OF STUDY DOMAIN:
tab <- read.csv('SDGridArea.csv', header = T)
point <- sort(unique(tab$PointID)) #6013 points in the study area
# EXTRACTING DATA (P, TMAX, TMIN, LW, SW AND RH):
clusterEvalQ(cl, {
library(ncdf4)
})
clusterExport(cl, c('infile','climvars','point'))
foreach(i = climvars) %dopar% {
climvar <- ncvar_get(infile, varid = i) # all data points 13650 points
dim <- dim(climvar)
climMX <- aperm(climvar,c(3,2,1))
dim(climMX) <- c(dim[3],dim[1]*dim[2])
climdt <- data.frame(climMX[,point]) #getting 6013 points in the study area
write.table(climdt,paste0('SD',i,'daily.csv'), sep = ',', row.names = F)
}
stopCluster(cl)
And the error is:
Error in { : task 1 failed - "error returned from C call"
Calls: %dopar% -> <Anonymous>
Execution halted
Could you please explain what is wrong with this code? I assume it has something to do with the fact that the cluster couldn't find out which variable to get from the file, since 'error returned from C call' usually comes from ncvar_get varid argument.
I had the same problem (identical error message) running a similar R script on my MacBook Pro (OSX 10.12.5). The problem seems to be that the different workers from the foreach loop try to access the same .nc file at the same time with ncvar_get. This can be solved by using ncvar_get outside the foreach loop (storing all the data in a big array) and accessing that array from within the foreach loop.
Obviously, another solution would be to appropriately split up the .nc file before and then accessing the different .nc files from within the foreach loop. This should lower memory consumption since copying of the big array to each worker is avoided.
I had the same issue on a recently acquired work machine. However, the same code runs fine on my home server.
The difference is that on my server I build the netCDF libraries with parallel access enabled (which requires HDF5 compiled with some MPI compiler).
I suspect this feature can prevent the OP's error from happening.
EDIT:
In order to have NetCDF with paralel I/O, first you need to build HDF5 with the following arguments:
./configure --prefix=/opt/software CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx FC=/usr/bin/mpifort
And then, when building the NetCDF C and Fortran libraries, you can also enable tests with the parallel I/O to make sure everything works fine:
./configure -prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx (C version)
./configure --prefix=/opt/software --enable-parallel-tests CC=/usr/bin/mpicc FC=/usr/bin/mpifort F77=/usr/bin/mpifort (Fortran version)
Of course, in order to do that you need to have some kind of MPI library (MPICH, OpenMPI) installed on your computer.

Predict memory usage in R

I have downloaded a huge file from the UCI Machine learning Dataset library. (~300mb).
Is there a way to predict the memory required to load the dataset, before loading it into R memory?
Googled a lot, but everywhere all I could find is how to calculate memory with R-profiler and several other packages, but after loading the objects into R.
based on "R programming" coursera course, U can calculate the proximate memory usage using number of rows and columns within the data" U can get that info from the codebox/meta file"
memory required = no. of column * no. of rows * 8 bytes/numeric
so for example if you have 1,500,00 rows and 120 column you will need more than 1.34 GB of spare memory required
U also can apply the same approach on other types of data with attention to number of bytes used to store different data types.
If your data's stored in a csv file, you could first read in a subset of the file and calculate the memory usage in bytes with the object.size function. Then, you could compute the total number of lines in the file with the wc command-line utility and use the line count to scale the memory usage of your subset to get an estimate of the total usage:
top.size <- object.size(read.csv("simulations.csv", nrow=1000))
lines <- as.numeric(gsub("[^0-9]", "", system("wc -l simulations.csv", intern=T)))
size.estimate <- lines / 1000 * top.size
Presumably there's some object overhead, so I would expect size.estimate to be an overestimate of the total memory usage when you load the whole csv file; this effect will be diminished if you use more lines to compute top.size. Of course, this approach could be inaccurate if the first 1000 lines of your file are not representative of the overall file contents.
R has the function object.size(), that provides an estimate of the memory that is being used to store an R object.
You can use like this:
predict_data_size <- function(numeric_size, number_type = "numeric") {
if(number_type == "integer") {
byte_per_number = 4
} else if(number_type == "numeric") {
byte_per_number = 8 #[ 8 bytes por numero]
} else {
stop(sprintf("Unknown number_type: %s", number_type))
}
estimate_size_in_bytes = (numeric_size * byte_per_number)
class(estimate_size_in_bytes) = "object_size"
print(estimate_size_in_bytes, units = "auto")
}
# Example
# Matrix (rows=2000000, cols=100)
predict_data_size(2000000*100, "numeric") # 1.5 Gb

How do I save all the draws from a MCMC posterior distribution to a file in R

I'm running a Hierarchical Lin Regr model using bayesm package in R. I have a data set with one dependent and 6 predictors. There are 207 unique respondents with 35 observations for each.
I began by using
print(out$betadraw)
Then I read about the sink function to output the out$betadraw to a file. I thought that sink function would capture all the draws. Instead the draws were truncated after certain number of draws.
I need to capture all the draws. In addition, is it possible to pass objects from bayesm package to coda package for convergence diagnostics? Any help would be greatly appreciated.
Without a reproducible example, it's hard to know for sure what is going on.
You should be able to open a text connection using ?file with the open argument set to write. Then, you can capture output and write it to your file using ?write with the append argument set to TRUE. The following worked fine on my machine:
> zz <- file(description="some name.txt", open="w")
> isOpen(zz)
[1] TRUE
> for(i in 1:100000){
+ x <- rbeta(1000, shape1=10, shape2=10)
+ write(x, file=zz, append=TRUE)
+ }
> close(zz)
(Note, I wouldn't try running that; it took nearly a half an hour and created a 962 MB file that could only be opened with EditPad.)

load a large text file to a list in R

So I have textfile that contains data in the right format that it should be to be inside a list in R but it is 14 Mb and apparently a 2Mb is a limitation? I need to load this text file into R as a list.
There is another post here but that command (see below) just errors out
inlist <- strsplit(readLines("myList.txt"), "[[:space:]]+")
thanks
What I mean by what it looks like since it is huge here how it starts
structure(list(inputsTrain = structure(c(-73, -69, -48, 13, -86, -147, -65, -71, -32, 100, -73, -196, -102, 37, 14, 55, ........
It appears that your data is the result of dput(mylist, file = 'mylist.txt')
I would suggest using the inverse of dput, which is dget
inlist <-dget('mylist.txt')
which simply is a wrapper for
eval(parse(file = 'mylist.txt'))
I've tested this on a 9mb file, without error or warning.
For example
dput(as.list(seq_len(1e6)), 'foo')
# foo is a 9.3 megabyte file
x <- dget('foo')
# works nicely
In future, don't save R objects as ascii representations, instead use saveRDS to save a serialized version, which can be read by readRDS

Resources