I'm putting the finishing touches of a project, and have a bit of a dilemma. Once all the data is gathered and statistics calculated, the results are then printed to the screen. However, in the program, the user is given the option of saving all the output to a file. I'd like to print data to both the terminal and file with same formatting.
I considered doing a fork(), but this is all one process and the data output is done just before the program termination. If I fork, then the child process will start executing from the beginning, and implementing successfully would result in a not so minor rewrite of 500+ LOC.
I covered roughly this exact same topic last semester, but left my unix programming book at home and none of the examples I've found fit my needs.
Consider piping your output through the tee command, which writes to stdout and a file.
Related
I have an intensive simulation task that is ran in parallel on a high performance cluster.
Each thread (~3000) is using an R scripts to write the simulation output with the fwrite function of the data.table package.
Our IT-Guy told me to use locks. So I use the flock package to lock the file while all threads are writing to it.
But this created a new bottleneck. Most of the time the processes wait until they can write. Now I was wondering how can I evaluate whether the lock is really necessary? To me it just seems very weird that more than 90 % of the processing time for all jobs is spent on waiting for the lock.
Can anyone tell me if it really is necessary to use locks when I only append results to a csv with the fwrite function and the argument append = T?
Edit:
I already tried writing individual files and merge them in various ways after all jobs were completed. But merging took also too long to be acceptable.
It still seems to be the best way to just write all simulation results to one file without lock. This works very fast and I did not find errors when doing it without the lock for a smaller number of simulations.
Could writing without lock cause some problems that will be unnoticed after running millions of simulations?
(I started writing a few comments to this effect, then decided to wrap them up in an answer. This isn't a perfect step-by-step solution, but your situation is not so simple, and quick-fixes are likely to have unintended side-effects in the long-term.)
I completely agree that relying on file-locking is not a good path. Even if the shared filesystem[1] supports them "fully" (many claim it but with caveats and/or corner-cases), they almost always have some form of performance penalty. Since the only time you need the data all together is at data harvesting (not mid-processing), the simplest approach in my mind is to write to individual files.
When the whole processing is complete, either (a) combine all files into one (simple bash scripts) and bulk-insert into a database; (b) combine into several big files (again, bash scripts) that are small enough to be read into R; or (c) file-by-file insert into the database.
Combine all files into one large file. Using bash, this might be as simple as
find mypath -name out.csv -print0 | xargs -0 cat > onebigfile.csv
Where mypath is the directory under which all of your files are contained, and each process is creating its own out.csv file within a unique sub-directory. This is not a perfect assumption, but the premise is that if each process creates a file, you should be able to uniquely identify those output files from all other files/directories under the path. From there, the find ... -print0 | xargs -0 cat > onebigfile.csv is I believe the best way to combine them all.
From here, I think you have three options:
Insert into a server-based database (postgresql, sql server, mariadb, etc) using the best bulk-insert tool available for that DBMS. This is a whole new discussion (outside the scope of this Q/A), but it can be done "formally" (with a working company database) or "less-formally" using a docker-based database for your project use. Again, docker-based databases can be an interesting and lengthy discussion.
Insert into a file-based database (sqlite, duckdb). Both of those options allege supporting file sizes well over what you would require for this data, and they both give you the option of querying subsets of the data as needed from R. If you don't know the DBI package or DBI way of doing things, I strongly suggest starting at https://dbi.r-dbi.org/ and https://db.rstudio.com/.
Splitting the file and then read piece-wise into R. I don't know if you can fit the entire data into R, but if you can and the act of reading them in is the hurdle, then
split --lines=1000000 onebigfile.csv smallerfiles.csv.
HDR=$(head -n 1 onebigfile.csv
sed -i -e "1i ${HDR}" smallerfiles.csv.*
sed -i -e "1d" smallerfiles.csv.aa
where 1000000 is the number of rows you want in each smaller file. You will find n files named smallerfiles.csv.aa, *.ab, *.ac, etc ... (depending on the size, perhaps you'll see three or more letters).\
The HDR= and first sed prepends the header row into all smaller files; since the first smaller file already has it, the second sed removes the duplicate first row.
Read each file individually into R or into the database. To bring into R, this would be done with something like:
files <- list.files("mypath", pattern = "^out.csv$", recursive = TRUE, full.names = TRUE)
library(data.table)
alldata <- rbindlist(lapply(files, fread))
assuming that R can hold all of the data at one time. If R cannot (either doing it this way or just reading onebigfile.csv above), then you really have no other options than a form of database[2].
To read them individually into the DBMS, you could likely do it in bash (well, any shell, just not R) and it would be faster than R. For that matter, though, you might as well combine into onebigfile.csv and do the command-line insert once. One advantage, however, of inserting individual files into the database is that, given a reasonably-simple bash script, you could read the data in from completed threads while other threads are still working; this provides mid-processing status cues and, if the run-time is quite long, might give you the ability to do some work before the processing is complete.
Notes:
"Shared filesystem": I'm assuming that these are not operating on a local-only filesystem. While certainly not impossible, most enterprise high-performance systems I've dealt with are based on some form of shared filesystem, whether it be NFS or GPFS or similar.
"Form of database": technically, there are on-disk file formats that support partial reads in R. While vroom:: can allegedly do memory-mapped partial reads, I suspect you might run into problems later as it may eventually try to read more than memory will support. Perhaps disk.frame could work, I have no idea. Other formats such as parquet or similar might be usable, I'm not entirely sure (nor do I have experience with them to say more than this).
I have multiple global climate model (GCM) data files. I successfully cropped one file but it takes time to crop one by one over thousand data files manually. How can I do multiple crops? Thanks
What you need is a loop combined with some patience... ;-)
for file in $(ls *.nc) ; do
cdo sellonlatbox,lon1,lon2,lat1,lat2 $file ${file%???}_crop.nc
done
the %??? chops off the ".nc" and then I add "_crop" to the output file name...
I know I am a little late to add to the answer, but still wanted to add my knowledge for those who would still pass through this question.
I followed the code given by Adrian Tompkins and it seems to work exceptionally fine. But there are somethings to be considered which I'd like to highlight. And because I am still novice at programming, please forgive my much limited answer. So here are my findings for the code above to work...
The code used calls CDO (Climate Data Operators) which is a non-GUI standalone program that can be utilized in Linux terminal. In my case, I used it in my Windows 10 through WSL (Ubuntu 20.04 LTS). There are good videos in youtube for using WSL in youtube.
The code initially did not work for until I made a slight change. The code
for file in $(ls *.nc) ; do
cdo sellonlatbox,lon1,lon2,lat1,lat2 $file ${file%???}_crop.nc
done
worked for me when I wrote it as
for file in $(ls *.nc) ; do
cdo sellonlatbox,lon1,lon2,lat1,lat2 $file ${file%???}_crop.nc;
done
see the presence of a ; in the code in line 2.
The entire code (in my case to work) was put in a text file (can be put as script in other formats as well) and passed as a script in the Linux terminal. The procedure to execute the script file is as follows:
3.a) create the '.txt' file containing the script above. Do note that the directory should be checked in all steps.
3.b) make the file executable by running the command line
chmod +x <name_of_the_textfile_with_extension> in the terminal.
3.c) run the script (in my case it is the textfile) by running the command line ./<name_of_the_textfile_with_extension>
The above procedures will give you cropped netcdf files for the corresponding netcdf files in the same folder.
cheers!
I use parSapply() from parallel package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table(), because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table() executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?
It is now possible to create file locks using filelock (GitHub)
In order to facilitate this with parSapply() you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again or Sys.sleep() for a short amount of time. However, I am not certain how this will affect your performance.
Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file.
If size is an issue then you can use disk.frame to work with files that are larger than your system RAM.
The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory:
if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`
Lets say I want to use sink for writing to a file in R.
sink("outfile.txt")
cat("Line one.\n")
cat("Line two.")
sink()
question 1. I have seen people writing sink() at the end, why do we need this? Can something go wrong when we do not have this?
question 2. What is the best way to write many lines one by one to file with a for-loop, where you also need to format each line? That is I might need to have different number in each line, like in python I would use outfile.write("Line with number %.3f",1.231) etc.
Question 1:
The sink function redirects all text going to the stdout stream to the file handler you give to sink. This means that anything that would normally print out into your interactive R session, will now instead be written to the file in sink, in this case "outfile.txt".
When you call sink again without any arguments you are telling it to resume using stdout instead of "outfile.txt". So no, nothing will go wrong if you don't call sink() at the end, but you need to use it if you want to start seeing output again in your R session/
As #Roman has pointed out though, it is better to explicitly tell cat to output to the file. That way you get only what you want, and expect to see in the file, while still getting the rest ouf the output in the R session.
Question 2:
This also answers question two. R (as far as I am aware) does not have direct file handling like in python. Instead you can just use cat(..., file="outfile.txt") inside a for loop.
I'd like to be able to write the contents of a help file in R to a file from within R.
The following works from the command-line:
R --slave -e 'library(MASS); help(survey)' > survey.txt
This command writes the help file for the survey data file
--slave hides both the initial prompt and commands entered from the
resulting output
-e '...' sends the command to R
> survey.txt writes the output of R to the file survey.txt
However, this does not seem to work:
library(MASS)
sink("survey.txt")
help(survey)
sink()
How can I save the contents of a help file to a file from within R?
Looks like the two functions you would need are tools:::Rd2txt and utils:::.getHelpFile. This prints the help file to the console, but you may need to fiddle with the arguments to get it to write to a file in the way you want.
For example:
hs <- help(survey)
tools:::Rd2txt(utils:::.getHelpFile(as.character(hs)))
Since these functions aren't currently exported, I would not recommend you rely on them for any production code. It would be better to use them as a guide to create your own stable implementation.
While Joshua's instructions work perfectly, I stumbled upon another strategy for saving an R helpfile; So I thought I'd share it. It works on my computer (Ubuntu) where less is the R pager. It essentially just involves saving the file from within less.
help(survey)
Then follow these instructions to save less buffer to file
i.e., type g|$tee survey.txt
g goes to the top of the less buffer if you aren't already there
| pipes text between the range starting at current mark
and ending at $ which indicates the end of the buffer
to the shell command tee which allows standard out to be sent to a file