Missing results from foreach and Parallel - r

Due to memory contraints in a previous script, I modified it following this advice in a similar issue as mine (do not give more data than needed by workers -
reading global variables using foreach in R). Unfortunately, now I'm struggling with missing results.
The script iterates over an 1.9M columns matrix, proccess each column and returns one row dataframe (rbind function from foreach combines each row). However, when it print the results, there are less rows (results) than the number of columns and this quantity changes every run. Seemingly, there is no error in the function inside foreach loop as it used to run smoothly in the previous script and no error or warning message pops up.
New Script:
if(!require(R.utils)) { install.packages("R.utils"); require(R.utils)}
if(!require(foreach)) { install.packages("foreach"); require(foreach)}
if(!require(doParallel)) { install.packages("doParallel"); require(doParallel)}
if(!require(data.table)) { install.packages("data.table"); require(data.table)}
registerDoParallel(cores=6)
out.file = "data.result.167_6_inside.out"
out.file2 = "data.result.167_6_outside.out"
data1 = fread("data.txt",sep = "auto", header=FALSE, stringsAsFactors=FALSE,na.strings = "NA")
data2 = transpose(data1)
rm(data1)
data3 = data2[,3: dim(data2)[2]]
levels2 = data2[-1,1:(3-1)]
rm(data2)
colClasses=c(ID="character",Col1="character",Col2="character",Col3="character",Col4="character",Col5="character",Col6="character")
res_table = dataFrame(colClasses,nrow=0)
write.table(res_table , file=out.file, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
write.table(res_table, file=out.file2, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
tableRes = foreach(col1=data3, .combine="rbind") %dopar% {
id1 = col1[1]
df2function = data.frame(levels2[,1,drop=F],levels2[,2,drop=F],as.numeric(col1[-1]))
mode(df2function[,1])="numeric"
mode(df2function[,2])="numeric"
values1 <- try (genericFuntion(df2function), TRUE)
if (is.numeric(try (values1$F, TRUE)))
{
res_table [1,1] = id1
res_table [1,2] = values1$F[1,1]
res_table [1,3] = values1$F[1,2]
res_table [1,4] = values1$F[1,3]
res_table [1,5] = values1$F[2,2]
res_table [1,6] = values1$F[2,3]
res_table [1,7] = values1$F[3,3]
} else
{
res_table[1,1] = id1
res_table[1,2] = NA
res_table[1,3] = NA
res_table[1,4] = NA
res_table[1,5] = NA
res_table[1,6] = NA
res_table[1,7] = NA
}
write.table(fstats_table, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
return(fstats_table[1,])
}
write.table(tableFst, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
In the previous script, the foreach was that way:
tableRes = foreach(i=1:length(data3), iter=icount(), .combine="rbind") %dopar% { (same code as above) }
Thus, I would like to know what are the possible causes of this behaviour.
I'm running this script in a cluster asking 80 Gb of memory (and 6 cores in this example). This is the largest amount of RAM I can request one a single node to be sure that script will not fail due to the lack of memory. (Each node has a pair of 14-core Intel Xeon skylake CPUs 2.6GHz, 128GB of RAM; OS - RHEL 7)
Ps 1: Although the new script is not paging anymore (even with more than 8 cores), seems that each child process still loading large amounts of data in the memory (~6 Gb) as I tracked using top command.
Ps 2: The new script is printing the results inside and outside the foreach loop to track if the loss of data occurs during or after the loop finishes and as I noticed every run gives me different amount of printed results inside and outside the loop.
P3: The fastest run was based on 20 cores (6 sec for 1000 iterations) and the slowest was 56 sec on a single core (tests performed using microbenchmark with 10 replications). However, more cores leads to less results being returned in the full matrix (1.9M columns).
I really appreciate any help you can provide,

Related

how to write out multiple files in R?

I am a newbie R user. Now, I have a question related to write out multiple files with different names. Lets says that my data has the following structure:
IV_HAR_m1<-matrix(rnorm(1:100), ncol=30, nrow = 2000)
DV_HAR_m1<-matrix(rnorm(1:100), ncol=10, nrow = 2000)
I am trying to estimate multiple LASSO regressions. At the beginning, I was storing the iterations in one object called Dinamic_beta. This object was stored in only one file, and it saves the required information each time that my code iterate.
For doing this I was using stew which belongs to pomp package, but the total process takes 5 or 6 days and I am worried about a power outage or a fail in my computer.
Now, I want to save each environment (iterations) in a .Rnd file. I do not know how can I do that? but the code that I am using is the following:
library(glmnet)
library(Matrix)
library(pomp)
space <- 7 #THE NUMBER OF FILES THAT I would WANT TO CREATE
Dinamic_betas<-array(NA, c(10, 31, (nrow(IV_HAR_m1)-space)))
dimnames(Dinamic_betas) <- list(NULL, NULL)
set.seed(12345)
stew( #stew save the enviroment in a .Rnd file
file = "Dinamic_LASSO_RD",{ # The name required by stew for creating one file with all information
for (i in 1:dim(Dinamic_betas)[3]) {
tryCatch( #print messsages
expr = {
cv_dinamic <- cv.glmnet(IV_HAR_m1[i:(space+i-1),],
DV_HAR_m1[i:(space+i-1),], alpha = 1, family = "mgaussian", thresh=1e-08, maxit=10^9)
LASSO_estimation_dinamic<- glmnet(IV_HAR_m1[i:(space+i-1),], DV_HAR_m1[i:(space+i-1),],
alpha = 1, lambda = cv_dinamic$lambda.min, family = "mgaussian")
coefs <- as.matrix(do.call(cbind, coef(LASSO_estimation_dinamic)))
Dinamic_betas[,,i] <- t(coefs)
},
error = function(e){
message("Caught an error!")
print(e)
},
warning = function(w){
message("Caught an warning!")
print(w)
},
finally = {
message("All done, quitting.")
}
)
if (i%%400==0) {print(i)}
}
}
)
If someone can suggest another package that stores the outputs in different files I will grateful.
Try adding this just before the close of your loop
save.image(paste0("Results_iteration_",i,".RData"))
This should save your entire workspace to disk for every iteration. You can then use load() to load the workspace of every environment. Let me know if this works.

How to continuously send data from LabVIEW to R? (code help)

I am trying to bring real time data from LabVIEW (vibration of a bearing and temperature) into an app written in R to create a control chart. It works for a while but eventually crashes with the following error message:
Error in aggregate.data.frame(B, list(rep(1:(nrow(B)%/%n + 1), each = n, :
no rows to aggregate
The process works as LabVIEW takes the data and projects it onto two Excel files. Those files are read in the R code and used to project a control chart in R. The process succeeds for some time, and the failure moment is not always the same amount of time. Sometimes the control chart will run for 6-7 min, other times is will crash in 2 min.
My suspicion is that the Excel files are not being updated fast enough, so the R code tries to read that Excel file when it is empty.
Any suggestions would be great! thank you!
I have tried to lower the sample size taken per second. That did not work.
getwd()
setwd("C:/Users/johnd/Desktop/R Data")
while(1) {
A = fread("C:/Users/johnd/Desktop/R Data/a1.csv" , skip = 4 , header = FALSE , col.names = c("t1","B2","t2","AM","t3","M","t4","B1"))
t1 = A$t1
B2 = A$B2
t2 = A$t2
AM = A$AM
t3 = A$t3
M = A$M
t4 = A$t4
B1 = A$B1
B = fread("C:/Users/johnd/Desktop/R Data/b1.csv" , skip = 4 , header = FALSE , col.names = c("T1","small","T2","big"))
T1 = B$T1
small = B$small
T2 = B$T2
big = B$big
DJ1 = A[seq(1,nrow(A),1),c('t1','B2','AM','M','B1')]
DJ1
n = 16
DJ2 = aggregate(B,list(rep(1:(nrow(B)%/%n+1),each=n,len=nrow(B))),mean)[-1]
DJ2
#------------------------------------------------------------------------
DJ6 = cbind(DJ1[,'B1'],DJ2[,c('small','big')]) # creates matrix for these three indicators
DJ6
#--------------T2 Hand made---------------------------------------------------------------------
new_B1 = DJ6[,'B1']
new_small = DJ6[,'small'] ### decompose the DJ6 matrix into vectors for each indicator(temperature, big & small accelerometers)
new_big = DJ6[,'big']
new_B1
new_small
new_big
mean_B1 = as.numeric(colMeans(DJ6[,'B1']))
mean_small = as.numeric(colMeans(DJ6[,'small'])) ##decomposes into vectors of type numeric
mean_big = as.numeric(colMeans(DJ6[,'big']))
cov_inv = data.matrix(solve(cov(DJ6))) # obtain inverse covariance matrix
cov_inv
p = ncol(DJ6) #changed to pull number of parameters by taking the number of coumns in OG matrix #p=3 # #ofQuality Characteristics
m=64 # #of samples (10 seconds of data)
a_alpha = 0.99
f= qf(a_alpha , df1 = p,df2 = (m-p)) ### calculates the F-Statistic for our data
f
UCL = (p*(m+1)*(m-1)*(f))/(m*(m-p)) ###produces upper control limit
UCL
diff_B1 = new_B1-mean_B1
diff_small = new_small-mean_small
diff_big = new_big-mean_big
DJ7 = cbind(diff_B1, diff_small , diff_big) #produces matrix of difference between average and observations (x-(x-bar))
DJ7
# DJ8 = data.matrix(DJ7[1,])
# DJ8
DJ9 = data.matrix(DJ7) ### turns matrix into appropriate numeric form
DJ9
# T2.1.1 = DJ8 %*% cov_inv %*% t(DJ8)
# T2.1.1
# T2.1 = t(as.matrix(DJ9[1,])) %*% cov_inv %*% as.matrix(DJ9[1,])
# T2.1
#T2 <- NULL
for(i in 1:64){ #### creates vector of T^2 statistic
T2<- t(as.matrix(DJ9[i,])) %*% cov_inv %*% as.matrix(DJ9[i,]) # calculation of T^2 test statistic ## there is no calculation of x-double bar
write.table(T2,"C:/Users/johnd/Desktop/R Data/c1.csv",append=T,sep="," , col.names = FALSE)#
#
DJ12 <-fread("C:/Users/johnd/Desktop/R Data/c1.csv" , header = FALSE ) #
}
# DJ12
DJ12$V1 = 1:nrow(DJ12)
# plot(DJ12 , type='l')
p1 = nrow(DJ12)-m
p2 = nrow(DJ12)
plot(DJ12[p1:p2,], type ='o', ylim =c(0,15), ylab="T2 Chart" , xlab="Data points") ### plots last 640 points
# plot(DJ12[p1:p2,], type ='o' , ylim =c(0,15) , ylab="T2 Chart" , xlab="Data points")
abline(h=UCL , col="red") ## displays upper control limit
Sys.sleep(1)
}
The process succeeds for some time, and the failure moment is not always the same amount of time. Sometimes the control chart will run for 6-7 min, other times is will crash in 2 min.
My suspicion is that the Excel files are not being updated fast enough, so the R code tries to read that Excel file when it is empty.
Your suspicion is correct.
With your current design, your R application can crash depending on how fast it runs relative to your LabVIEW application. This is called a race condition; you must eliminate race conditions from your code.
A quick and dirty solution
One simple solution to avoid the crash is to call NROW to check if any data exists. If there's no data available, don't call aggregate. This is described here: error message in r: no rows to aggregate
A more robust solution
A better solution is to use a communications protocol like TCP to stream data from LabVIEW to R, instead of using CSV files to transfer real-time data. For example, your R program could listen for data on a TCP socket. Make it wait for data to be sent from LabVIEW before running your data processing code.
Here is an example on using socketConnection in R: http://blog.corynissen.com/2013/05/using-r-to-communicate-via-socket.html
Here is an example on sending/receiving data over TCP in LabVIEW: http://www.ni.com/product-documentation/2710/en/

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

mcapply: all scheduled cores encountered errors in user code

The following is my code. I am trying get the list of all the files (~20000) that end with .idat and read each file using the function illuminaio::readIDAT.
library(illuminaio)
library(parallel)
library(data.table)
# number of cores to use
ncores = 8
# this gets all the files with .idat extension ~20000 files
files <- list.files(path = './',
pattern = "*.idat",
full.names = TRUE)
# function to read the idat file and create a data.table of filename, and two more columns
# write out as csv using fwrite
get.chiptype <- function(x)
{
idat <- readIDAT(x)
res <- data.table(filename = x, nSNPs = nrow(idat$Quants), Chip = idat$ChipType)
fwrite(res, file.path = 'output.csv', append = TRUE)
}
# using mclapply call the function get.chiptype on all 20000 files.
# use 8 cores at a time
mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores)
After reading and writing info about 1200 files, I get the following message:
Warning message:
In mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores) :
all scheduled cores encountered errors in user code
How do I resolve it?
Calling mclapply() in some instances requires you to specify a random number generator that allows for multiple streams of random numbers.
R version 2.14.0 has an implementation of Pierre L'Ecuyer's multiple pseudo-random number generator.
Try adding the following before the mclapply() call, with a pre-specified value for 'my.seed':
set.seed( my.seed, kind = "L'Ecuyer-CMRG" );

Reading in all files with a specific extension

I have several csv files which are stored in a folder "C://Users//Prices//" I want to read this files in R and store them as a dataframe. I tried a for loop however this would take hours to read in all the files(I measured the system.time()).
Can can this be done besides using a for loop?
I will reiterate that fread is significantly quicker as is shown in this post on Stack Overflow: Quickly reading very large tables as dataframes in R. In summary, the tests (on a 51 Mb file - 1e6 rows x 6 columns) showed an performance improvement of over 70% against the best alternative methods including sqldf, ff and read.table with and without the optimised setting recommended in the answer by #lukeA. This was backed up in the comments which report a 4GB file loading in under a minute with fread, compared to 15 hours with base functions.
I ran some tests of my own, to compare alternative methods of reading and combining CSV files. The experimental setup is as follows:
Generate 4 column CSV file (character x 1, numeric x 3) for each run. There are 6 runs, each with a different number of rows, ranging from 10^1, 10^2,...,10^6 records in the data file.
Import the CSV file into R 10 times, joining with rbind or rbindlist to create a single table.
Test out read.csv & read.table, with and without optimised arguments such as colClasses, against fread.
Using microbenchmark repeat every test 10 times (probably unnecessarily high!), and collect the timings for each run.
The results show again in favour of fread with rbindlist over optimised read.table with rbind functionality.
This table shows the median total duration for 10 file reads & combines for each method and number of rows per file. The first 3 columns are in microseconds, the last 3 in seconds.
expr 10 100 1000 10000 1e+05 1e+06
1: FREAD 3.93704 5.229699 16.80106 0.1470289 1.324394 12.28122
2: READ.CSV 12.38413 18.887334 78.68367 0.9609491 8.820387 187.89306
3: READ.CSV.PLUS 10.24376 14.480308 60.55098 0.6985101 5.728035 51.83903
4: READ.TABLE 12.82230 21.019998 74.49074 0.8096604 9.420266 123.53155
5: READ.TABLE.PLUS 10.12752 15.622499 57.53279 0.7150357 5.715737 52.91683
This plot shows the comparison of timings when run 10 times on the HPC:
Normalising these values against the fread timing shows how much longer these other methods take for all scenarios:
10 100 1000 10000 1e+05 1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 3.145543 3.611553 4.683256 6.535784 6.659941 15.299223
READ.CSV.PLUS 2.601893 2.768861 3.603998 4.750835 4.325023 4.221001
READ.TABLE 3.256838 4.019352 4.433693 5.506811 7.112887 10.058576
READ.TABLE.PLUS 2.572370 2.987266 3.424355 4.863232 4.315737 4.308762
Table of results for 10 microbenchmark iterations on the HPC
Interestingly for 1 million rows per file the optimised version of read.csv and read.table take 422% and 430% more time than fread whilst without optimisation this leaps to around 1500% and 1005% longer.
Note that when I conducted this experiment on my powerful laptop as opposed to the HPC cluster the performance gains were somewhat less (around 81% slower as opposed to 400% slower). This is interesting in itself, not sure that I can explain it however!
10 100 1000 10000 1e+05 1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 2.595057 2.166448 2.115312 3.042585 3.179500 6.694197
READ.CSV.PLUS 2.238316 1.846175 1.659942 2.361703 2.055851 1.805456
READ.TABLE 2.191753 2.819338 5.116871 7.593756 9.156118 13.550412
READ.TABLE.PLUS 2.275799 1.848747 1.827298 2.313686 1.948887 1.832518
Table of results for only 5 `microbenchmark` iterations on my i7 laptop
Given that the data volume is reasonably large I'd suggest the benefits will not only be in the reading of the files with fread but in the subsequent manipulation of the data with the data.table package as opposed to traditional data.frame operations! I am lucky to have learned this lesson at an early stage and would recommend others to follow suit...
Here is the code used in the tests.
rm(list=ls()) ; gc()
library(data.table) ; library(microbenchmark)
#=============== FUNCTIONS TO BE TESTED ===============
f_FREAD = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = fread("file.csv") else x = rbindlist(list(x, fread("file.csv")))
}
}
f_READ.TABLE = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.table("file.csv") else x = rbind(x, read.table("file.csv"))
}
}
f_READ.TABLE.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
f_READ.CSV = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.csv("file.csv") else x = rbind(x, read.csv("file.csv"))
}
}
f_READ.CSV.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.csv("file.csv", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.csv("file.csv", comment.char="", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
#=============== MAIN EXPERIMENTAL LOOP ===============
for (i in 1:6)
{
NUM_ROWS = (10^i) # the loop allows us to test the performance over varying numbers of rows
NUM_READS = 10
# create a test data.table with the specified number of rows and write it to file
dt = data.table(
col1 = sample(letters[],NUM_ROWS,replace=TRUE),
col2 = rnorm(NUM_ROWS),
col3 = rnorm(NUM_ROWS),
col4 = rnorm(NUM_ROWS)
)
write.csv(dt, "file.csv", row.names=FALSE)
# run the imports for each method, recording results with microbenchmark
results = microbenchmark(
FREAD = f_FREAD(NUM_READS),
READ.TABLE = f_READ.TABLE(NUM_READS),
READ.TABLE.PLUS = f_READ.TABLE.PLUS(NUM_READS),
READ.CSV = f_READ.CSV(NUM_READS),
READ.CSV.PLUS = f_READ.CSV.PLUS(NUM_READS),
times = NUM_ITERATIONS)
results = data.table(NUM_ROWS = NUM_ROWS, results)
if (i == 1) results.all = results else results.all = rbindlist(list(results.all, results))
}
results.all[,time:=time/1000000000] # convert from nanoseconds
Speed up your read.table command by
predefining colClasses=c("numeric", "factor", ...)
setting stringsAsFactors=FALSE
disabling csv comments using comment.char=""
Via http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/

Resources