My goal is to do some operation on a dataframe as follows
exp_info <- data.frame(location.Id = 1:1e7,
x = rnorm(10))
For each location, I want to do the square of the x variable and write the individual file as csv. My actual computation is lengthier and has other stuffs so this is a simplistic example. This is how I am parallelising my task:
library(doParallel)
myClusters <- parallel::makeCluster(6)
doParallel::registerDoParallel(myClusters)
foreach(i = 1:nrow(exp_info),
.packages = c("dplyr","data.table"),
.errorhandling = 'remove',
.verbose = TRUE) %dopar%
{
rowRef <- exp_info[i, ]
rowRef <- rowRef %>% dplyr::mutate(x.sq = x^2)
fwrite(rowRef, paste0(i,'_iteration.csv'))
}
When I look at my working directory, I have all the individual csv files (1e7 csv files)
written out which says the above code is successful. However, my foreach loop does not end
even if all the files are written out and I have to kill the job which also does not generate any error. Does anyone have any idea why this could possibly happen?
I'm experiencing something similar. I don't know the answer but will add this: the same code and operation works on one computer, but fails to exit the for each loop on another computer. Hope this provides some direction.
Related
I have an R function that loads, processes, and saves many files. Here is a dummy version:
load_process_saveFiles <- function(onlyFiles = c()){
allFiles <- paste(LETTERS, '.csv', sep = '')
# If desired, only include certain files
if(length(onlyFiles) > 0){
allFiles <- allFiles[allFiles %in% onlyFiles]
}
for(file in allFiles){
# load file
rawFile <- file
# Run a super long function
processedFile <- rawFile
# Save file
# write.csv(processedFile, paste('./Other/Path/', file, sep = ''), row.names = FALSE)
cat('\nDone with file ', file, sep = '')
}
}
It has to run through about 30 files, and each one takes about 3 minutes. It can be very time consuming to loop through the entire thing. What I'd like to do is run each one separately at the same time so that it would take 3 minutes all together instead of 3 x 30 = 90 minutes.
I know I can achieve this by creating a bunch of RStudio sessions or many terminal tabs, but I can't handle having that many sessions or tabs open at once.
Ideally, I'd like to have all of the files with separate functions listed in one batchRun.R file which I can run from the terminal:
source('./PathToFunction/load_process_saveFiles.R')
load_process_saveFiles(onlyFiles = 'A.csv')
load_process_saveFiles(onlyFiles = 'B.csv')
load_process_saveFiles(onlyFiles = 'C.csv')
load_process_saveFiles(onlyFiles = 'D.csv')
load_process_saveFiles(onlyFiles = 'E.csv')
load_process_saveFiles(onlyFiles = 'F.csv')
So then run $ RScript batchRun.R from the terminal.
I've tried looking up different examples on SO trying to accomplish something similar, but each have some unique features and I just can't get it to work. Is what I'm trying to do possible? Thanks!
Package parallel gives you a number of options. One option is to parallelize the calls to load_process_saveFiles and have the loop inside of the function run serially. Another option is to parallelize the loop and have the calls run serially. The best way to assess which approach is more suitable for your job is to time them both yourself.
Evaluating the calls to load_process_saveFiles in parallel is relatively straightforward with mclapply, the parallel version of the base function lapply (see ?lapply):
parallel::mclapply(x, load_process_saveFiles, mc.cores = 2L)
Here, x is a list of values of the argument onlyFiles, and mc.cores = 2L indicates that you want to divide the calls among two R processes.
Evaluating the loop inside of load_process_saveFiles in parallel would involve replacing the entire for statement with something like
f <- function(file) {
cat("Processing file", file, "...")
x <- read(file)
y <- process(x)
write(y, file = file.path("path", "to", file))
cat(" done!\n")
}
parallel::mclapply(allFiles, f, ...)
and redefining load_process_saveFiles to allow optional arguments:
load_process_saveFiles <- function(onlyFiles = character(0L), ...) {
## body
}
Then you could do, for example, load_process_saveFiles(onlyFiles, mc.cores = 2L).
I should point out that mclapply is not supported on Windows. On Windows, you can use parLapply instead, but there are some extra steps involved. These are described in the parallel vignette, which can be opened from R with vignette("parallel", "parallel"). The vignette acts as general introduction to parallelism in R, so it could be worth reading anyway.
Parallel package is useful in this case. And if you are using Linux OS, I would recommend doMC package instead of parallel. This doMC package is useful even for looping over big data used in machine learning projects.
I'm trying to run a foreach loop as follows:
foreach(i=1:n, .combine=c, .packages=c("parallel", "doParallel", "pracma", "oce", "ineq", "gsw", "seewave", "soundecology", "data.table", "openxlsx", "tuneR", "vegan")) %dopar%
res[i,] <- indices(files[i])
The custom function indices() uses readWave() from the tuneR package to read wave files from a folder and loop through them. Each time I run this, I get the following error:
Error in readWave(x) : Object 'i' not found
The problem does not occur in a for loop. I've googled this but nobody seems to have had this one. Can anyone please help?
Thanks #Roland for pointing me in the right direction. Yes, I was trying to use foreach in a conceptually wrong way, identical to how for loop works. I was able to get it to work by changing it so:
palpha <- foreach(i = 1:n, .combine = "rbind", .packages = p) %dopar% indices(files[i])
I was later able to write the list obtained from foreach to my res data frame so:
res <- as.data.frame(palpha)
I have 3 csv files, namely file1.csv, file2.csv and file3.csv.
Now for each of the file, I would like to import the csv and perform some functions over them and then export a transformed csv. So , 3 csv in and 3 transformed csv out. And there are just 3 independent tasks. So I thought I can try to use foreach %dopar%. Please not that I am using a Window machine.
However, I cannot get this to work.
library(foreach)
library(doParallel)
library(xts)
library(zoo)
numCores <- detectCores()
cl <- parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
filenames <- c("file1.csv","file2.csv","file3.csv")
foreach(i = 1:3, .packages = c("xts","zoo")) %dopar%{
df_xts <- data_processing_IMPORT(filenames[i])
ddates <- unique(date(df_xts))
}
IF I comment out the last line ddates <- unique(date(df_xts)), the code runs fine with no error.
However, if I include the last line of code, I received the following error below, which I have no idea to get around. I tried to add .export = c("df_xts").
Error in { : task 1 failed - "unused argument (df_xts)"
It still doesn't work. I want to understand what's wrong with my logic and how should I get around this ? I am just trying to apply simple functions over the data only, I still haven't transformed the data and export them separately to csv. Yet I am already stuck.
The funny thing is I have written the simple code below, which works fine. Within the foreach, a is just like the df_xts above, being stored in a variable and passed into Fun2 to process. And the code below works fine. But above doesn't. I don't understand why.
numCores <- detectCores()
cl <- parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
# Define the function
Fun1=function(x){
a=2*x
b=3*x
c=a+b
return(c)
}
Fun2=function(x){
a=2*x
b=3*x
c=a+b
return(c)
}
foreach(i = 1:10)%dopar%{
x <- rnorm(5)
a <- Fun1(x)
tst <- Fun2(a)
return(tst)
}
### Output: No error
parallel::stopCluster(cl)
Update: I have found out that the issue is with the date function there to extract the number of dates within the csv file but I am not sure how to get around this.
The use of foreach() is correct. You are using date() in ddates <- unique(date(df_xts)) but this function returns the current system time as POSIX and does not require any arguments. Therefore the argument error is regarding the date() function.
So i guess you want to use as.Date() instead or something similar.
ddates <- unique(as.Date(df_xts))
I've run into the same issue about reading, modifying and writing several CSV files. I tried to find a tidyverse solution for this, and while it doesn't really deal with the date problem above, here it is -- how to read, modify and write, several csv files using map from purrr.
library(tidyverse)
# There are some sample csv file in the "sample" dir.
# First get the paths of those.
datapath <- fs::dir_ls("./sample", regexp = ("csv"))
datapath
# Then read in the data, such as it is a list of data frames
# It seems simpler to write them back to disk as separate files.
# Another way to read them would be:
# newsampledata <- vroom::vroom(datapath, ";", id = "path")
# but this will return a DF and separating it to different files
# may be more complicated.
sampledata <- map(datapath, ~ read_delim(.x, ";"))
# Do some transformation of the data.
# Here I just alter the column names.
transformeddata <- sampledata %>%
map(rename_all, tolower)
# Then prepare to write new files
names(transformeddata) <- paste0("new-", basename(names(transformeddata)))
# Write the csv files and check if they are there
map2(transformeddata, names(transformeddata), ~ write.csv(.x, file = .y))
dir(pattern = "new-")
I am trying to use the quasi-quotation syntax (quo, exprs, !!, etc.) as well as the foreach function to create several new variables by means of a named list of expressions to be evaluated inside the rxDataStep function, specifically, the transforms argument. I am getting the following error:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I have a dataset which includes a number of variables with I need to log-transform in order to perform further analyses. I have been using the rx functions from the "RevoScaleR" package for roughly three years and totally missed the "tidyverse"/pipeline method of data transformation techniques. I do occasionally dabble with these tools but prefer to stick with the aforementioned rx functions giving my relative familiarity and the fact that they have served me very well thus far.
As a MWE:
Required libraries:
library(foreach)
library(rlang)
Creating variables which need to be log-transformed.
vars <- foreach(i = 10:20, .combine = "cbind") %do% rnorm(10, i)
Dataframe with identifier and above variables.
data_in <- data.frame(id = 1:10, vars)
Object which creates the expressions of the log-transformed variables; this creates a named list.
log_vars <- foreach(i = names(data_in[-1]), .final = function(x) set_names(x, paste0(names(data_in[-1]), "_log"))) %do%
expr(log10(!!sym(i)))
Now attempting to add the variables to the existing dataframe.
data_out <- rxDataStep(inData = data_in, transforms = log_vars, transformObjects = list(log_vars = log_vars))
The resulting error is the following:
Error in rxLinkTransformComponents(transforms = transforms, transformFunc = transformFunc, : 'transforms' must be of the form list(...)
I simply cannot understand the error given that log_vars is defined as a named list. One can check this with str and typeof.
I have tried a slightly different way of defining the new variables:
log_vars <- unlist(foreach(i = names(data_in[-1]), j = paste0(names(data_in[-1]), "_log")) %do%
exprs(!!j := log10(!!sym(i))))
I have to use unlist given that exprs delivers a list as output already. Either way, I get the same error as before.
Naturally, I expect to have 10 new variables named result.1_log, result.2_log, etc. inserted into the dataframe. Instead, I receive the above error and the new dataframe is not created.
I suspected that the rx functions do not like working with the quasi-quotation syntax, however, I have used it before when having to identify subjects with NA values of certain variables. This was done using the rowSelection argument of rxDataStep. I do realise that rowSelection requires a single, logical expression while transforms requires a named list of expressions.
Any help would be much appreciated since this type of data transformation will keep up again in my analyses. I do suspect that I simply do not understand the inner workings of the quasi-quotation syntax or perhaps how lists work in general but, hopefully there is a simple fix.
I am using Microsoft R Open 3.4.3.
My session info is the following:
R Services Information:
Local R: C:\Program Files\Microsoft\ML Server\R_SERVER\
Version: 1.3.40517.1016
Operating System: Microsoft Windows 10.0.17134
CPU Count: 4
Physical Memory: 12169 MB, 6810 MB free
Virtual Memory: 14025 MB, 7984 MB free
Video controller[1]: Intel(R) HD Graphics 620
GPU[1]: Intel(R) HD Graphics Family
Video memory[1]: 1024 MB
Connected users: 1
I'm not quite sure what you're trying to do as I think you've made things too complicated.
If all you want to do is take the log of each # in each data point, then I show two approaches below.
Approach #1 is static, you know the fixed # of columns and hard code it. It's a bit faster for rxDataStep to run in this approach.
Approach #2 is a bit more dynamic, taking advantage of a transformFunc. transformFunc works in chunks, so it can be used safely in a clustered fashion. rxDataStep knows how to integrate the chunks together. But there will be a bit of a performance hit for it.
You might have been trying to find a hybrid approach - dynamically build the list for the transforms parameter in the rxDataStep. I haven't found a way to get that to work. Here's a similar question for doing it in rxSetVarInfo (Change a dynamic variable name with rxSetVarInfo) but using that approach hasn't yielded success for me yet.
Let me know if I've completely missed the mark!
library(foreach)
library(rlang)
startSize <- 10
endSize <- 20
vars <- foreach(i = startSize:endSize, .combine = "cbind") %do% rnorm(10, i)
data_in <- data.frame(vars)
tempInput <- tempfile(fileext = ".xdf")
tempOutput <- tempfile(fileext = ".xdf")
rxImport(inData = data_in, outFile = tempInput, overwrite = T)
rxGetInfo(tempInput, getVarInfo = T)
### Approach #1
print("Approach #1")
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transforms = list(
log_R1 = log10(result.1),
log_R2 = log10(result.2),
log_R3 = log10(result.3),
log_R4 = log10(result.4),
log_R5 = log10(result.5),
log_R6 = log10(result.6),
log_R7 = log10(result.7),
log_R8 = log10(result.8),
log_R9 = log10(result.9),
log_R10 = log10(result.10),
log_R11 = log10(result.11)))
rxGetInfo(tempOutput, getVarInfo = T)
### Approach #2
print("Approach #2")
logxform <- function(dataList) {
numRowsInChunk <- length(dataList$result.1)
for (j in 1:columnDepth) {
dataList[[paste0("log_R",j)]] <- rep(0, times=numRowsInChunk)
for (i in 1:numRowsInChunk) {
dataList[[paste0("log_R",j)]][i] <- log10(dataList[[paste0("result.",j)]][i])
}
}
return(dataList)
}
rxDataStep(inData = tempInput, outFile = tempOutput, overwrite = T,
transformObjects = list(columnDepth = endSize - startSize + 1),
transformFunc = logxform)
rxGetInfo(tempOutput, getVarInfo = T)
I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!