Using a for loop to write in multiple .grd files - r

I am working with very large data layers for a SDM class and because of this I ended up breaking some of my layers into a bunch of blocks to avoid memory restraint. These blocks were written out as .grd files, and now I need to get them read back into R and merged together. I am extremely new to R an programming in general so any help would be appreciated. What I have been trying so far looks like this:
merge.coarse=raster("coarseBlock1.grd")
for ("" in 2:nBlocks){
merge.coarse=merge(merge.coarse,raster(paste("coarseBlock", ".grd", sep="")))
}
where my files are in coarseBlock.grd and are sequentially numbered from 1 to nBlocks (259)
Any feed back would be greatly appreciated.

Using for loops is generally slow in R. Also, using functions like merge and rbind in a for loop eat up a lot of memory because of the way R passes values to these functions.
A more efficient way to do this task would be to call lapply (see this tutorial on apply functions for details) to load the files into R. This will result in a list which can then be collapsed using the rbind function:
rasters <- lapply(list.files(GRDFolder), FUN = raster)
merge.coarse <- do.call(rbind, rasters)

I'm not too familiar with .grd files, but this overall process should at least get you going in the right direction. Assuming all your .grd files (1 through 259) are stored in the same folder (which I will refer to as GRDFolder), then you can try this:
merge.coarse <- raster("coarseBlock1.grd")
for(filename in list.files(GRDFolder))
{
temp <- raster(filename)
merge.coarse <- rbind(merge.coarse, temp)
}

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

How to output multiple objects with different names using for loop in R

I'm attempting to write an R script in a way that remains as automated as possible. To this end, I am trying to create a for loop to execute a function on multiple files. The outputs need to be saved as objects for the purposes of the program I am using and therefore each output from the for loop needs to have a distinct name. This is the code I have so far:
filenames <- as.list(Sys.glob("*.ab1"))
SeqOb <- list()
for (i in filenames)
{
SeqOb <- readsangerseq(i)
}
"readsangerseq" is the function I'm attempting to execute to create multiple SeqOb objects. What I've read from other discussions led me to create an empty list in which to store my output objects, but I can't seem to figure out how to make the for loop write them as distinct outputs.
If you would like to continue using the for loop and want distinct outputs instead of a list you may consider using assign(paste()) in order to give each file a unique object name. Although, as a relative newcomer to R myself, I'm starting to learn there are more elegant ways than for loops as well, such as MrFlick's answer.
for (i in 1:length(filenames)) {
#You may be able to substitute your function in the line below
assign(paste("SomeNamingRule", i, sep = ""), (readsangerseq(i)))
}

Splitting a csv file into multiple txt. files

I have a large csv dataset that I want to split into multiple txt files. I want the name of each file to come from the ID column and the content of each file to come from the Text column. My data looks something like this.
ID Text
1 I like dogs
2 My name is
3 It is sunny
Would anyone be able to help advise? I don't mind using excel or R.
Thank you!
In R, You can split the data by ID and use writeLines to write it in text files.
If your dataframe is called df, try :
temp <- split(df$Text, df$ID)
Map(function(x, y) writeLines(x, paste0(y, '.txt')), temp, names(temp))
If you have a lot of rows, this is a good task for parallel computing. (Here's the general premise: R spends a lot of time formatting the file. Writing to the disk can't be done in parallel, but formatting the file can.) So let's do it in parallel!
The furrr package is one of my favorites: In short, it adds parallel processing capabilities to the purrr package, whose map functions are quite useful. In this case, we want to use the future_pmap function, which lets us apply a function to each row of a dataframe. This should be all the code you need:
library(furrr)
plan(multiprocess)
future_pmap(df, function(id, value) {write(value, paste0(id, ".txt"))})
I tested the parallel and normal versions of this function on a dataframe with 31,496 rows, and the parallel version took only 60 percent as long. This method is also about 20 percent faster than Ronak Shah's writeLines method.

R script, programmatically batch import multiple csv files as list of data frames (solution)

I'm relatively new to R but experienced in traditional programming languages (e.g., C, Java). I've recently run into the situation where I had so many data files to load that I was spending almost as much time on that one task as I was on the actual analysis. I spent a little time googling this but didn't run across any solutions that I found directly relevant (I might have missed something, I'm impatient that way). Despite that I came up with a simple solution to my problem that I wanted to share with the community in case anyone else found themselves in similar circumstances.
A bit of background info: The data I'm analyzing is real-time performance and diagnostic metrics for an experimental system that is driven by real-time data feeds (i.e., complicated). The upshot is that between trials filenames don't change and the data is written out directly to csv files (I wrote the logging code so I get to be my own best friend like that ;). There are dozens of files generated during a single trial and we have potentially hundreds of trials to look forward to.
I had a few ideas and after playing around with the code a bit I came up with the following solution:
# Create mapping that associates files with a handle that the loader will use to
# generate a named list of data frames (don't even try this on the cmdline)
createDataFileMapping <- function() {
list(
c(file = "file1.csv", descr = "descriptor1"),
c(file = "file2.csv", descr = "descriptor2"),
...
)
}
# Batch load csv files and return as list of data frames
loadTrialData <- function(load.dir, mapping) {
dfList <- list()
for (item in mapping) {
file <- paste(load.dir, item[["file"]], sep = "/")
df <- read.csv(file)
dfList[[ item[["descr"]] ]] <- df
}
return(dfList)
}
Invoking is as simple as loadTrialData("~/data/directory", createDataFileMapping()).
I'm sure there are other ways to solve this problem but the above gets the job done in my case. I'm sure this is slightly less memory-efficient than loading the files directly into data frames in the global environment, and the syntax for passing individual data frames to analysis/plotting functions isn't as elegant as it could be, but I'm not choosy. If you have a more flexible/generalizable solution then please don't hesitate to post!
What you have is sound, I would add only two comments:
Don't worry about extra memory usage, assuming the data frames are of nontrivial size you won't lose much putting them in a big list.
You might add ... as an argument to your function and pass it through to read.csv, so that if another user needs to specify extra arguments because their file wasn't in quite the same format (or wants stringsAsFactors=FALSE or something) then they have the flexibility to do that.

Very long list of ".asc" files in R and apply do.call. How to deal with it?

Hope you can help me with this. I am working on a code which allows me to put together very long list of .asc files (they contain 307200 (640*480) pixel with temperature information from a thermal IR camera each of the files). I have developed the code putting together only 5 files. However, when trying to apply the code over the whole file list (e.g. 4000 .asc files more or less) R gets stuck. I have been told I should modify my code using Batch processing or any optimazing code. However I am not an expert on this field and I need some help.
I have included here the first part of the code which lists all the .asc files and put them together in a sigle dataframe.
temp = list.files(pattern="*.asc")
myfiles = do.call("cbind", lapply(temp, function(x) read.csv(x,
sep="\t",dec=",",stringsAsFactors=FALSE,header=FALSE)))
mf<-myfiles[c(-1:-8),]
colnames(mf)<-seq(1,ncol(mf),by=1)
rownames(mf)<-seq(1,nrow(mf),by=1)
for (i in 1:ncol(mf)){
mf[,i]<-sub(",",".",mf[,i])
mf[,i]<-sub("\t","",mf[,i])
}
for(i in 1:ncol(mf)){
mf[,i]<-as.numeric(mf[,i])
}
Thanks in advance for your help! :)
Amaia

Resources