Are locks necessary for writing with fwrite from parallel processes in R? - r

I have an intensive simulation task that is ran in parallel on a high performance cluster.
Each thread (~3000) is using an R scripts to write the simulation output with the fwrite function of the data.table package.
Our IT-Guy told me to use locks. So I use the flock package to lock the file while all threads are writing to it.
But this created a new bottleneck. Most of the time the processes wait until they can write. Now I was wondering how can I evaluate whether the lock is really necessary? To me it just seems very weird that more than 90 % of the processing time for all jobs is spent on waiting for the lock.
Can anyone tell me if it really is necessary to use locks when I only append results to a csv with the fwrite function and the argument append = T?
Edit:
I already tried writing individual files and merge them in various ways after all jobs were completed. But merging took also too long to be acceptable.
It still seems to be the best way to just write all simulation results to one file without lock. This works very fast and I did not find errors when doing it without the lock for a smaller number of simulations.
Could writing without lock cause some problems that will be unnoticed after running millions of simulations?

(I started writing a few comments to this effect, then decided to wrap them up in an answer. This isn't a perfect step-by-step solution, but your situation is not so simple, and quick-fixes are likely to have unintended side-effects in the long-term.)
I completely agree that relying on file-locking is not a good path. Even if the shared filesystem[1] supports them "fully" (many claim it but with caveats and/or corner-cases), they almost always have some form of performance penalty. Since the only time you need the data all together is at data harvesting (not mid-processing), the simplest approach in my mind is to write to individual files.
When the whole processing is complete, either (a) combine all files into one (simple bash scripts) and bulk-insert into a database; (b) combine into several big files (again, bash scripts) that are small enough to be read into R; or (c) file-by-file insert into the database.
Combine all files into one large file. Using bash, this might be as simple as
find mypath -name out.csv -print0 | xargs -0 cat > onebigfile.csv
Where mypath is the directory under which all of your files are contained, and each process is creating its own out.csv file within a unique sub-directory. This is not a perfect assumption, but the premise is that if each process creates a file, you should be able to uniquely identify those output files from all other files/directories under the path. From there, the find ... -print0 | xargs -0 cat > onebigfile.csv is I believe the best way to combine them all.
From here, I think you have three options:
Insert into a server-based database (postgresql, sql server, mariadb, etc) using the best bulk-insert tool available for that DBMS. This is a whole new discussion (outside the scope of this Q/A), but it can be done "formally" (with a working company database) or "less-formally" using a docker-based database for your project use. Again, docker-based databases can be an interesting and lengthy discussion.
Insert into a file-based database (sqlite, duckdb). Both of those options allege supporting file sizes well over what you would require for this data, and they both give you the option of querying subsets of the data as needed from R. If you don't know the DBI package or DBI way of doing things, I strongly suggest starting at https://dbi.r-dbi.org/ and https://db.rstudio.com/.
Splitting the file and then read piece-wise into R. I don't know if you can fit the entire data into R, but if you can and the act of reading them in is the hurdle, then
split --lines=1000000 onebigfile.csv smallerfiles.csv.
HDR=$(head -n 1 onebigfile.csv
sed -i -e "1i ${HDR}" smallerfiles.csv.*
sed -i -e "1d" smallerfiles.csv.aa
where 1000000 is the number of rows you want in each smaller file. You will find n files named smallerfiles.csv.aa, *.ab, *.ac, etc ... (depending on the size, perhaps you'll see three or more letters).\
The HDR= and first sed prepends the header row into all smaller files; since the first smaller file already has it, the second sed removes the duplicate first row.
Read each file individually into R or into the database. To bring into R, this would be done with something like:
files <- list.files("mypath", pattern = "^out.csv$", recursive = TRUE, full.names = TRUE)
library(data.table)
alldata <- rbindlist(lapply(files, fread))
assuming that R can hold all of the data at one time. If R cannot (either doing it this way or just reading onebigfile.csv above), then you really have no other options than a form of database[2].
To read them individually into the DBMS, you could likely do it in bash (well, any shell, just not R) and it would be faster than R. For that matter, though, you might as well combine into onebigfile.csv and do the command-line insert once. One advantage, however, of inserting individual files into the database is that, given a reasonably-simple bash script, you could read the data in from completed threads while other threads are still working; this provides mid-processing status cues and, if the run-time is quite long, might give you the ability to do some work before the processing is complete.
Notes:
"Shared filesystem": I'm assuming that these are not operating on a local-only filesystem. While certainly not impossible, most enterprise high-performance systems I've dealt with are based on some form of shared filesystem, whether it be NFS or GPFS or similar.
"Form of database": technically, there are on-disk file formats that support partial reads in R. While vroom:: can allegedly do memory-mapped partial reads, I suspect you might run into problems later as it may eventually try to read more than memory will support. Perhaps disk.frame could work, I have no idea. Other formats such as parquet or similar might be usable, I'm not entirely sure (nor do I have experience with them to say more than this).

Related

Partially read really large csv.gz in R using vroom

I have a csv.gz file that (from what I've been told) before compression was 70GB in size. My machine has 50GB of RAM, so anyway I will never be able to open it as a whole in R.
I can load for example the first 10m rows as follows:
library(vroom)
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7)
For what I have to do, it is fine to load 10m rows at the time, do my operations, and continue with the next 10m rows. I could do this in a loop.
I was therefore trying the skip argument.
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7, skip = 10^7)
This results in an error:
Error: The size of the connection buffer (131072) was not large enough
to fit a complete line:
* Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
I increased this with Sys.setenv("VROOM_CONNECTION_SIZE" = 131072*1000), however, the error persists.
Is there a solution to this?
Edit: I found out that random access to a gzip compressed csv (csv.gz) is not possible. We have to start from top. Probably the easiest is to decompress and save, then skip should work.
I haven't been able to figure out vroom solution for very large more-than-RAM (gzipped) csv files. However, the following approach has worked well for me and I'd be grateful to know about approaches with better querying speed while also saving disk space.
Use split sub-command inxsv from https://github.com/BurntSushi/xsv to split the large csv file into comfortably-within-RAM chunks of say, 10^5, lines and save them in a folder.
Read all chunks using data.table::fread one-by-one (to avoid low-memory error) using a for loop and save all of them into a folder as compressed parquet files using arrow package which saves space and prepares the large table for fast querying. For even faster operations, it is advisable to re-save the parquet files partitioned by the fields by which you need to frequently filter.
Now you can use arrow::open_dataset and query that multi-file parquet folder using dplyr commands. It takes minimum disk space and gives the fastest results in my experience.
I use data.table::fread with explicit definition of column classes of each field for fastest and most reliable parsing of csv files. readr::read_csv has also been accurate but slower. However, auto-assignment of column classes by read_csv as well as the ways in which you can custom-define column classes by read_csv is actually the best - so less human-time but more machine-time - which means that it may be faster overall depending on scenario. Other csv parsers have thrown errors for the kind of csv files that I work with and waste time.
You may now delete the folder containing chunked csv files to save space, unless you want to experiment loop-reading them with other csv parsers.
Other previously successfully approaches: Loop read all csv chunks as mentioned above and save them into:
a folder using disk.frame package. Then that folder may be queried using dplyr or data.table commands explained in the documentation. It has facility to save in compressed fst files which saves space, though not as much as parquet files.
a table in DuckDB database which allows querying with SQL or dplyr commands. Using database-tables approach won't save you disk space. But DuckDB also allows querying partitioned/un-partitioned parquet files (which saves disk space) with SQL commands.
EDIT: - Improved Method Below
I experimented a little and found a much better way to do the above operations. Using the code below, the large (compressed) csv file will be chunked automatically within R environment (no need to use any external tool like xsv) and all chunks will be written in parquet format in a folder ready for querying.
library(readr)
library(arrow)
fyl <- "...path_to_big_data_file.csv.gz"
pqFolder <- "...path_to_folder_where_chunked_parquet_files_are_to_be_saved"
f <- function(x, pos){
write_parquet(x,
file.path(pqFolder, paste0(pos, ".parquet")),
compression = "gzip",
compression_level = 9)
}
read_csv_chunked(
fyl,
col_types = list(Column1="f", Column2="c", Column3="T", ...), # all column specifications
callback = SideEffectChunkCallback$new(f),
chunk_size = 10^6)
If, instead of parquet, you want to use -
disk.frame, the callback function may be used to create chunked compressed fst files for dplyr or data.table style querying.
DuckDB, the callback function may be used to append the chunks into a database table for SQL or dplyr style querying.
By judiciously choosing the chunk_size parameter of readr::read_csv_chunked command, the computer should never run out of RAM while running queries.
PS: I use gzip compression for parquet files since they can then be previewed with ParquetViewer from https://github.com/mukunku/ParquetViewer. Otherwise, zstd (not currently supported by ParquetViewer) decompresses faster and hence improves reading speed.
EDIT 2:
I got a csv file which was really big for my machine: 20 GB gzipped and expands to about 83 GB, whereas my home laptop has only 16 GB. Turns out that the read_csv_chunked method I mentioned in earlier EDIT fails to complete. It always stops working after some time and does not create all parquet chunks. Using my previous method of splitting the csv file with xsv and then looping over them creating parquet chunks worked. To be fair, I must mention it took multiple attempts this way too and I had programmed a check to create only additional parquet chunks when running the program on successive attempts.
EDIT 3:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203
EDIT 4:
Additional tip: The chunked parquet files created by the above mentioned method may be very conveniently queried using SQL with DuckDB method mentioned at
https://duckdb.org/docs/data/parquet
and
https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB method is significant because R Arrow method currently suffers from a very serious limitation which is mentioned in the official documentation page https://arrow.apache.org/docs/r/articles/dataset.html.
Specifically, and I quote: "In the current release, arrow supports the dplyr verbs mutate(), transmute(), select(), rename(), relocate(), filter(), and arrange(). Aggregation is not yet supported, so before you call summarise() or other verbs with aggregate functions, use collect() to pull the selected subset of the data into an in-memory R data frame."
The problem is that if you use collect() on a very big dataset, the RAM usage spikes and the system crashes. Whereas, using SQL statements to do the same aggregation job on the same big-dataset with DuckDB does not cause RAM usage spikes and does not cause system crash. So until Arrow fixes itself for aggregation queries for big-data, SQL from DuckDB provides a nice solution to querying big datasets in chunked parquet format.

Can data.table's fread accept connections?

I have an executable that I can call using the system() command. This executable will print some data which I can pipe into R using:
read.csv(pipe(command))
fread has amazing performance which I would like to take advantage of bringing the data in, but I cannot use fread(pipe(command)). The alternative is to use the executable and dump its output to a file first, then read it in using fread. Doing so requires writing intermediate data to disk, and also adds overhead by introducing that intermediate step. Is there a way to wrap or use fread with my executable?
fread can't take connections for now and the feature has been requested in 2015: https://github.com/Rdatatable/data.table/issues/561
Even though Maksim's comment would be valid, it would not work on a windows machine. Which, in some cases can be troublesome.

Lock file when writing to it from parallel processes in R

I use parSapply() from parallel package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table(), because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table() executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?
It is now possible to create file locks using filelock (GitHub)
In order to facilitate this with parSapply() you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again or Sys.sleep() for a short amount of time. However, I am not certain how this will affect your performance.
Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file.
If size is an issue then you can use disk.frame to work with files that are larger than your system RAM.
The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory:
if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`

Open large files with R

I want to process a file (1.9GB) that contains 100.000.000 datasets in R.
Actually I only want to have every 1000th dataset.
Each dataset contains 3 Columns, separated by a tab.
I tried: data <- read.delim("file.txt"), but R Was not able to manage all datasets at once.
Can I tell R directly to load only every 1000th dataset from the file?
After reading the file I want to bin the data of column 2.
Is it possible to directly bin the number written in column 2?
Is it possible the read the file line by line, without loading the whole file into the memory?
Thanks for your help.
Sven
You should pre-process the file using another tool before reading into R.
To write every 1000th line to a new file, you can use sed, like this:
sed -n '0~1000p' infile > outfile
Then read the new file into R:
datasets <- read.table("outfile", sep = "\t", header = F)
You may want to look at the manual devoted to R Data Import/Export.
Naive approaches always load all the data. You don't want that. You may want another script which reads line-by-line (written in awk, perl, python, C, ...) and emits only every N-th line. You can then read the output from that program directly in R via a pipe -- see the help on Connections.
In general, very large memory setups require some understanding of R. Be patient, you will get this to work but once again, a naive approach requires lots of RAM and a 64-bit operating system.
Maybe package colbycol could be usefull to you.

Is there a way to read and write in-memory files in R?

I am trying to use R to analyze large DNA sequence files (fastq files, several gigabytes each), but the standard R interface to these files (ShortRead) has to read the entire file at once. This doesn't fit in memory, so it causes an error. Is there any way that I can read a few (thousand) lines at a time, stuff them into an in-memory file, and then use ShortRead to read from that in-memory file?
I'm looking for something like Perl's IO::Scalar, for R.
I don’t know much about R, but have you had a look at the mmap package?
It looks like ShortRead is soon to add a "FastqStreamer" class that does what I want.
Well, I don't know about readFastq accepting something other than a file...
But if it can, for other functions, you can use the R function pipe() to open a unix connection, then you could do this with a combination of unix commands head and tail and some pipes.
For example, to get lines 90 to 100, you use this:
head file.txt -n 100 | tail -n 10
So you can just read the file in chunks.
If you have to, you can always use these unix utilities to create a temporary file, then read that in with shortRead. It's a pain but if it can only take a file, at least it works.
Incidentally, the answer to generally how to do an in-memory file in R (like Perl's IO::Scalar) is the textConnection function. Sadly though, the ShortRead package cannot handle textConnection objects as inputs, so while the idea that I expressed in the question of reading a file in small chunks into in-memory files which are then parsed bit by bit is certainly possible for many applications, but not for may particular application since ShortRead does not like textConnections. So the solution is the FastqStreamer class described above.

Resources