I am working with large data sets and often switch between my work station and laptop. Saving a workspace image to .RData is for me the most natural and convenient way, so this is the file that I want to synchronize between the two computers.
Unfortunately, it tends to be rather big (a few GB), so efficient synchronisation either requires me to connect my laptop with a cable or moving the files with a USB stick. If I forgot to synchronize my laptop when I was next to my workstation, it takes me hours to make sure everything is synchronized.
The largest objects, however, change relatively rarely (although I constantly work with them). I could save them to another file, and then delete them before saving the session and load them after restoring the session. This would work, but would be extremely annoying. Also, I would have to remember to save them whenever they are modified. It would soon end up being a total mess.
Is there more efficient way of dealing with such large data chunks?
For example, my problem would be solved if there was an alternative format to .RData -- one in which .RData is a directory, and files in that directory are objects to be loaded.
You can use saveRDS:
objs.names <- ls()
objs <- mget(objs.names)
invisible(
lapply(
seq_along(objs),
function(x) saveRDS(objs[[x]], paste0("mydatafolder/", objs.names[[x]], ".rds"))
) )
This will save every object in your session to the "mydatafolder" folder as a separate file (make sure to create the folder before hand).
Unfortunately, this will modify the timestamps of all objects, you can't rely on rsync. You could first read the objects in with readRDS, see which ones have changed with identical, and only run the lapply above on the changed objects so you can then use something like rsync.
Related
I have an intensive simulation task that is ran in parallel on a high performance cluster.
Each thread (~3000) is using an R scripts to write the simulation output with the fwrite function of the data.table package.
Our IT-Guy told me to use locks. So I use the flock package to lock the file while all threads are writing to it.
But this created a new bottleneck. Most of the time the processes wait until they can write. Now I was wondering how can I evaluate whether the lock is really necessary? To me it just seems very weird that more than 90 % of the processing time for all jobs is spent on waiting for the lock.
Can anyone tell me if it really is necessary to use locks when I only append results to a csv with the fwrite function and the argument append = T?
Edit:
I already tried writing individual files and merge them in various ways after all jobs were completed. But merging took also too long to be acceptable.
It still seems to be the best way to just write all simulation results to one file without lock. This works very fast and I did not find errors when doing it without the lock for a smaller number of simulations.
Could writing without lock cause some problems that will be unnoticed after running millions of simulations?
(I started writing a few comments to this effect, then decided to wrap them up in an answer. This isn't a perfect step-by-step solution, but your situation is not so simple, and quick-fixes are likely to have unintended side-effects in the long-term.)
I completely agree that relying on file-locking is not a good path. Even if the shared filesystem[1] supports them "fully" (many claim it but with caveats and/or corner-cases), they almost always have some form of performance penalty. Since the only time you need the data all together is at data harvesting (not mid-processing), the simplest approach in my mind is to write to individual files.
When the whole processing is complete, either (a) combine all files into one (simple bash scripts) and bulk-insert into a database; (b) combine into several big files (again, bash scripts) that are small enough to be read into R; or (c) file-by-file insert into the database.
Combine all files into one large file. Using bash, this might be as simple as
find mypath -name out.csv -print0 | xargs -0 cat > onebigfile.csv
Where mypath is the directory under which all of your files are contained, and each process is creating its own out.csv file within a unique sub-directory. This is not a perfect assumption, but the premise is that if each process creates a file, you should be able to uniquely identify those output files from all other files/directories under the path. From there, the find ... -print0 | xargs -0 cat > onebigfile.csv is I believe the best way to combine them all.
From here, I think you have three options:
Insert into a server-based database (postgresql, sql server, mariadb, etc) using the best bulk-insert tool available for that DBMS. This is a whole new discussion (outside the scope of this Q/A), but it can be done "formally" (with a working company database) or "less-formally" using a docker-based database for your project use. Again, docker-based databases can be an interesting and lengthy discussion.
Insert into a file-based database (sqlite, duckdb). Both of those options allege supporting file sizes well over what you would require for this data, and they both give you the option of querying subsets of the data as needed from R. If you don't know the DBI package or DBI way of doing things, I strongly suggest starting at https://dbi.r-dbi.org/ and https://db.rstudio.com/.
Splitting the file and then read piece-wise into R. I don't know if you can fit the entire data into R, but if you can and the act of reading them in is the hurdle, then
split --lines=1000000 onebigfile.csv smallerfiles.csv.
HDR=$(head -n 1 onebigfile.csv
sed -i -e "1i ${HDR}" smallerfiles.csv.*
sed -i -e "1d" smallerfiles.csv.aa
where 1000000 is the number of rows you want in each smaller file. You will find n files named smallerfiles.csv.aa, *.ab, *.ac, etc ... (depending on the size, perhaps you'll see three or more letters).\
The HDR= and first sed prepends the header row into all smaller files; since the first smaller file already has it, the second sed removes the duplicate first row.
Read each file individually into R or into the database. To bring into R, this would be done with something like:
files <- list.files("mypath", pattern = "^out.csv$", recursive = TRUE, full.names = TRUE)
library(data.table)
alldata <- rbindlist(lapply(files, fread))
assuming that R can hold all of the data at one time. If R cannot (either doing it this way or just reading onebigfile.csv above), then you really have no other options than a form of database[2].
To read them individually into the DBMS, you could likely do it in bash (well, any shell, just not R) and it would be faster than R. For that matter, though, you might as well combine into onebigfile.csv and do the command-line insert once. One advantage, however, of inserting individual files into the database is that, given a reasonably-simple bash script, you could read the data in from completed threads while other threads are still working; this provides mid-processing status cues and, if the run-time is quite long, might give you the ability to do some work before the processing is complete.
Notes:
"Shared filesystem": I'm assuming that these are not operating on a local-only filesystem. While certainly not impossible, most enterprise high-performance systems I've dealt with are based on some form of shared filesystem, whether it be NFS or GPFS or similar.
"Form of database": technically, there are on-disk file formats that support partial reads in R. While vroom:: can allegedly do memory-mapped partial reads, I suspect you might run into problems later as it may eventually try to read more than memory will support. Perhaps disk.frame could work, I have no idea. Other formats such as parquet or similar might be usable, I'm not entirely sure (nor do I have experience with them to say more than this).
Since I work with large RasterBrick objects, the objects can't be stored in memory but are stored as temporary files in the current Temporary files directory tempdir() or to be exact in the subfolder "raster".
Due to the large file sizes it would be very nice to delete the temporary files of the unused objects.
If I delete objects I no longer need by
rm(list=ls(pattern="xxx")
the temporary files still exist.
Garbage collection gc() will have no effect on that to my understanding since it has no effect on the hard drive.
The automatically given names of the temporary files don't show any relation to the object names.
Therefore it is not possible to delete them by a code like
raster_temp_dir <- paste(tempdir(), "/raster", sep="")
files_to_be_removed <- list.files(rastertemp, pattern="xxx", full.names=T)
Unfortunately the files of objects still in use aren't read-only.
Therefore I also would delete objects I still need by running:
files_to_be_removed <- list.files(rastertemp, full.names=T)
Did somebody already solve this problem or has any ideas how to solve it?
It would be perfect if somehow, there a code could distinguish between unused and used objects.
Since this is unlikely to implement a detour could be naming the temporary files of the Raster objects manually, but I haven't encountered an option for this neither since the filename argument can only be used when writing files to the hard disk but not when temporary files are created (to my knowledge).
Thanks!
I think the function you're looking for is file.remove(). Just pass it a vector with the file names you want to delete.
What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.
My question is: how to save the output i.e., mydata
mydata=array(sample(100),dim=c(2,100,4000))
reasonably fast?
I used the reshape2 package as suggested here.
melt(mydata)
and
write.table(mydata,file="data_1")
But it is taking more than one hour to save the data into the file. I am looking for any other faster ways to do the job.
I strongly suggest to refer to this great post, that surely helps in make issues clear about file saving.
Anyway, saveRDS could be the most adequate for you. The difference more relevant, in this case, is that save can save many objects to a file in a single call, whilst saveRDS, being a lower-level function, works with a single object at a time.
save and load allow you to save a named R object to a file or other connection and restore that object again. But, when loaded, the named object is restored to the current environment with the same name it had when saved.
saveRDS and loadRDS, instead, allow to save a single R object to a connection (typically a file) and to restore the object, possibly with a different name. The low level operability makes RDS functions more efficient, probably, for your case.
Read the help text for saveRDS using ?saveRDS. This will probably be the best way for you to save and load large dataframes.
saveRDS(yourdata, file = "yourdata.Rda")
I use parSapply() from parallel package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table(), because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table() executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?
It is now possible to create file locks using filelock (GitHub)
In order to facilitate this with parSapply() you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again or Sys.sleep() for a short amount of time. However, I am not certain how this will affect your performance.
Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file.
If size is an issue then you can use disk.frame to work with files that are larger than your system RAM.
The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory:
if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`