Lock file when writing to it from parallel processes in R - r

I use parSapply() from parallel package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table(), because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table() executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?

It is now possible to create file locks using filelock (GitHub)
In order to facilitate this with parSapply() you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again or Sys.sleep() for a short amount of time. However, I am not certain how this will affect your performance.
Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file.
If size is an issue then you can use disk.frame to work with files that are larger than your system RAM.

The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory:
if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`

Related

Are locks necessary for writing with fwrite from parallel processes in R?

I have an intensive simulation task that is ran in parallel on a high performance cluster.
Each thread (~3000) is using an R scripts to write the simulation output with the fwrite function of the data.table package.
Our IT-Guy told me to use locks. So I use the flock package to lock the file while all threads are writing to it.
But this created a new bottleneck. Most of the time the processes wait until they can write. Now I was wondering how can I evaluate whether the lock is really necessary? To me it just seems very weird that more than 90 % of the processing time for all jobs is spent on waiting for the lock.
Can anyone tell me if it really is necessary to use locks when I only append results to a csv with the fwrite function and the argument append = T?
Edit:
I already tried writing individual files and merge them in various ways after all jobs were completed. But merging took also too long to be acceptable.
It still seems to be the best way to just write all simulation results to one file without lock. This works very fast and I did not find errors when doing it without the lock for a smaller number of simulations.
Could writing without lock cause some problems that will be unnoticed after running millions of simulations?
(I started writing a few comments to this effect, then decided to wrap them up in an answer. This isn't a perfect step-by-step solution, but your situation is not so simple, and quick-fixes are likely to have unintended side-effects in the long-term.)
I completely agree that relying on file-locking is not a good path. Even if the shared filesystem[1] supports them "fully" (many claim it but with caveats and/or corner-cases), they almost always have some form of performance penalty. Since the only time you need the data all together is at data harvesting (not mid-processing), the simplest approach in my mind is to write to individual files.
When the whole processing is complete, either (a) combine all files into one (simple bash scripts) and bulk-insert into a database; (b) combine into several big files (again, bash scripts) that are small enough to be read into R; or (c) file-by-file insert into the database.
Combine all files into one large file. Using bash, this might be as simple as
find mypath -name out.csv -print0 | xargs -0 cat > onebigfile.csv
Where mypath is the directory under which all of your files are contained, and each process is creating its own out.csv file within a unique sub-directory. This is not a perfect assumption, but the premise is that if each process creates a file, you should be able to uniquely identify those output files from all other files/directories under the path. From there, the find ... -print0 | xargs -0 cat > onebigfile.csv is I believe the best way to combine them all.
From here, I think you have three options:
Insert into a server-based database (postgresql, sql server, mariadb, etc) using the best bulk-insert tool available for that DBMS. This is a whole new discussion (outside the scope of this Q/A), but it can be done "formally" (with a working company database) or "less-formally" using a docker-based database for your project use. Again, docker-based databases can be an interesting and lengthy discussion.
Insert into a file-based database (sqlite, duckdb). Both of those options allege supporting file sizes well over what you would require for this data, and they both give you the option of querying subsets of the data as needed from R. If you don't know the DBI package or DBI way of doing things, I strongly suggest starting at https://dbi.r-dbi.org/ and https://db.rstudio.com/.
Splitting the file and then read piece-wise into R. I don't know if you can fit the entire data into R, but if you can and the act of reading them in is the hurdle, then
split --lines=1000000 onebigfile.csv smallerfiles.csv.
HDR=$(head -n 1 onebigfile.csv
sed -i -e "1i ${HDR}" smallerfiles.csv.*
sed -i -e "1d" smallerfiles.csv.aa
where 1000000 is the number of rows you want in each smaller file. You will find n files named smallerfiles.csv.aa, *.ab, *.ac, etc ... (depending on the size, perhaps you'll see three or more letters).\
The HDR= and first sed prepends the header row into all smaller files; since the first smaller file already has it, the second sed removes the duplicate first row.
Read each file individually into R or into the database. To bring into R, this would be done with something like:
files <- list.files("mypath", pattern = "^out.csv$", recursive = TRUE, full.names = TRUE)
library(data.table)
alldata <- rbindlist(lapply(files, fread))
assuming that R can hold all of the data at one time. If R cannot (either doing it this way or just reading onebigfile.csv above), then you really have no other options than a form of database[2].
To read them individually into the DBMS, you could likely do it in bash (well, any shell, just not R) and it would be faster than R. For that matter, though, you might as well combine into onebigfile.csv and do the command-line insert once. One advantage, however, of inserting individual files into the database is that, given a reasonably-simple bash script, you could read the data in from completed threads while other threads are still working; this provides mid-processing status cues and, if the run-time is quite long, might give you the ability to do some work before the processing is complete.
Notes:
"Shared filesystem": I'm assuming that these are not operating on a local-only filesystem. While certainly not impossible, most enterprise high-performance systems I've dealt with are based on some form of shared filesystem, whether it be NFS or GPFS or similar.
"Form of database": technically, there are on-disk file formats that support partial reads in R. While vroom:: can allegedly do memory-mapped partial reads, I suspect you might run into problems later as it may eventually try to read more than memory will support. Perhaps disk.frame could work, I have no idea. Other formats such as parquet or similar might be usable, I'm not entirely sure (nor do I have experience with them to say more than this).

How to get data into h2o fast

What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

A more efficient .RData?

I am working with large data sets and often switch between my work station and laptop. Saving a workspace image to .RData is for me the most natural and convenient way, so this is the file that I want to synchronize between the two computers.
Unfortunately, it tends to be rather big (a few GB), so efficient synchronisation either requires me to connect my laptop with a cable or moving the files with a USB stick. If I forgot to synchronize my laptop when I was next to my workstation, it takes me hours to make sure everything is synchronized.
The largest objects, however, change relatively rarely (although I constantly work with them). I could save them to another file, and then delete them before saving the session and load them after restoring the session. This would work, but would be extremely annoying. Also, I would have to remember to save them whenever they are modified. It would soon end up being a total mess.
Is there more efficient way of dealing with such large data chunks?
For example, my problem would be solved if there was an alternative format to .RData -- one in which .RData is a directory, and files in that directory are objects to be loaded.
You can use saveRDS:
objs.names <- ls()
objs <- mget(objs.names)
invisible(
lapply(
seq_along(objs),
function(x) saveRDS(objs[[x]], paste0("mydatafolder/", objs.names[[x]], ".rds"))
) )
This will save every object in your session to the "mydatafolder" folder as a separate file (make sure to create the folder before hand).
Unfortunately, this will modify the timestamps of all objects, you can't rely on rsync. You could first read the objects in with readRDS, see which ones have changed with identical, and only run the lapply above on the changed objects so you can then use something like rsync.

Running jobs in background in R

I am working with a 250 by 250 matrix. However, it takes loads and loads of time to compute this. It takes like an hour at least.
Is it possible that I can store this matrix in memory in R, such that everytime I open up R, it is already there.
Ideally, I would like to know if it is possible to run a job on background in R , so that I dont have to wait an hour to get the matrix out and be able to play around with it.
1) You can save the workspace of R when closing R. Usually R asks "Save workspace image?" when you are closing it. If you will answer "Yes" it will save the workspace in a file named ".Rdata" and will load it when staring a new R instance.
2) The better option (more safe) is to save the matrix explicitly. There are several options how it can be done. One of the options is to save it as Rdata file:
save(m, file = "matrix.Rdata")
where m is your matrix.
You can load the matrix at any time with
load("matrix.Rdata")
if you are on the same working directory.
3) There is not such option as background computing for R. But you can open several R instances. Do computation in one instance, and do something else on other instance.
What would help is to output it to a file when you have computed it and then parse that file everytime you open R. Write yourself a computeMatrix() function or script to produce a file with the matrix stored in a sensible format. Also write yourself a loadMatrix() function or script to read in that file and load the matrix into memory for use, then call or run loadMatrix everytime you start R and want to use the matrix.
In terms of running an R job in the background, you can run an R script from the command line with the syntax "R CMD BATCH scriptName" with scriptName replaced by the name of your script.
It might be better to use the ff package and save the matrix as an ff object. This means that the actual matrix will be saved on the disk in an efficient manner, then when you start a new R session you can point to that same file without loading the entire matrix into memory. When you need part of the matrix, only the part you need will be loaded so it will be much quicker. Even if you need the entire matrix loaded into memory it should load faster than reading a text file.

Copy files while preserving original file information (creation time etc.)

In order to ease the manual copying of large file amounts, I often use FreeFileSync. I noticed that it preserves the original file information such as when a file was created, last modified etc.
Now I need to regularly copy tons of files in batch mode and I'd like to do it in R. So I wondered if R is capable of preserving that information as well. AFAIU, file.rename() and file.copy() alter the file information, e.g. the times are set to the time the files were actually copied.
Is there any way I can restore the original file information after the files have been copied?
Robocopy via system2() can keep the timestamps.
> cmdArgs<- paste( normalizePath( file.path(getwd()), winslash="/"),
normalizePath( file.path(getwd(), "bkup"), winslash="/" ),
"*.txt",
"/copy:DAT /V" )
> system2( "robocopy.exe", args=cmdArgs )
Robocopy has a slew of switches for all different types of use cases and can accept a 'job' file for the params and file names. The ability of R to call out using system could also be used to execute an elevated session (perhaps the easiest would be by using a powershell script to call Robocopy) so that all of the auditing info (permissions and such) could be retained as well.

Resources