rsync limit number of file versions - rsync

Simple question here: Is there a way to limit the number of backup version of files when using rsync ? Right now I'm using --backup and --backup-dir options but it keeps an infinit number of version of files which tends to take lot of space, so is there a way to limit that to like, keeping 3 version of a file ?
Thanks a lot in advance !
I tried to search already but couldn't find anything helpful

Related

readxl memory leak in R on Linux RStudio Server

I was working on something relatively simple: I have three files that weigh ~150MB each, about 240k rows and 145 columns each, and wanted to join them. The thing is, when I open the first file with readxl::read_excel, it suddenly requires 10GB of memory just to open the file, making it impossible for me to open the three files (was barely able to open the first one after several tries and reinstalling readxl), even though when this file is read, the dataframe object weighs 287MB as per object_size().
I'm a bit baffled as to why R is needing so much RAM to open my file. Any ideas on what could be happening? Something I might be missing? Any less memory intensive alternatives?
As extra information, when I opened the file I saw it has filters enabled and some table formatting from Excel.
Thank you very much

Using write.csv() function in R but it doesn't actually save it to my C:

I'm a newbie and have searched Stack, the internet, everywhere I can think of.. But I cannot figure out why when I use write.csv() in R it doesn't actually save it as a csv file on my computer. All I want is to get a .csv file of my work from RStudio to Tableau and I've spent a week trying to figure it out. Many of the answers I have read use too much coding "lingo" and I cannot translate it because I'm just a beginner. Would be so so thankful for any help.
Here is the code I'm using:
""write.csv(daily_steps2,"C:\daily_steps2.csv", row.names = TRUE)""
I put the double quotes around the code because it seems like that's what I'm supposed to do here? IDK, but I don't have those when I run the function. There is no error when I run this, it just doesn't show up as a .csv on my computer. It runs but actually does nothing. Thank you so much for any help.
In my opinion, the simplest way would be to save the file to the same folder that rstudio is running in, and use the rstudio gui. It should be write.csv(daily_steps, "./daily_steps.csv") (no quotes around the function), and then on the tab in the bottom right of rstudio, you can select files, and it should be there. Then you can use the graphic user interface to move it to your desktop in a way analogous to what you would do in MS word.
Quick fix is to use double slashes or forward slash for Windows paths. (Also, since row.names=TRUE is the default, there is no need to specify)
write.csv(daily_steps2, "C:\\daily_steps2.csv")
write.csv(daily_steps2, "C:/daily_steps2.csv")
However, consider the OS-agnostic file.path() and avoid issues of folder separators in file paths including forward slash (used on Unix systems like Mac and Linux) or backslash (used on Windows systems).
write.csv(daily_steps2, file.path("C:", "daily_steps2.csv"))
Another benefit is that because of this functional form to path expression, you can pass dynamic file or folder names without paste.

Are locks necessary for writing with fwrite from parallel processes in R?

I have an intensive simulation task that is ran in parallel on a high performance cluster.
Each thread (~3000) is using an R scripts to write the simulation output with the fwrite function of the data.table package.
Our IT-Guy told me to use locks. So I use the flock package to lock the file while all threads are writing to it.
But this created a new bottleneck. Most of the time the processes wait until they can write. Now I was wondering how can I evaluate whether the lock is really necessary? To me it just seems very weird that more than 90 % of the processing time for all jobs is spent on waiting for the lock.
Can anyone tell me if it really is necessary to use locks when I only append results to a csv with the fwrite function and the argument append = T?
Edit:
I already tried writing individual files and merge them in various ways after all jobs were completed. But merging took also too long to be acceptable.
It still seems to be the best way to just write all simulation results to one file without lock. This works very fast and I did not find errors when doing it without the lock for a smaller number of simulations.
Could writing without lock cause some problems that will be unnoticed after running millions of simulations?
(I started writing a few comments to this effect, then decided to wrap them up in an answer. This isn't a perfect step-by-step solution, but your situation is not so simple, and quick-fixes are likely to have unintended side-effects in the long-term.)
I completely agree that relying on file-locking is not a good path. Even if the shared filesystem[1] supports them "fully" (many claim it but with caveats and/or corner-cases), they almost always have some form of performance penalty. Since the only time you need the data all together is at data harvesting (not mid-processing), the simplest approach in my mind is to write to individual files.
When the whole processing is complete, either (a) combine all files into one (simple bash scripts) and bulk-insert into a database; (b) combine into several big files (again, bash scripts) that are small enough to be read into R; or (c) file-by-file insert into the database.
Combine all files into one large file. Using bash, this might be as simple as
find mypath -name out.csv -print0 | xargs -0 cat > onebigfile.csv
Where mypath is the directory under which all of your files are contained, and each process is creating its own out.csv file within a unique sub-directory. This is not a perfect assumption, but the premise is that if each process creates a file, you should be able to uniquely identify those output files from all other files/directories under the path. From there, the find ... -print0 | xargs -0 cat > onebigfile.csv is I believe the best way to combine them all.
From here, I think you have three options:
Insert into a server-based database (postgresql, sql server, mariadb, etc) using the best bulk-insert tool available for that DBMS. This is a whole new discussion (outside the scope of this Q/A), but it can be done "formally" (with a working company database) or "less-formally" using a docker-based database for your project use. Again, docker-based databases can be an interesting and lengthy discussion.
Insert into a file-based database (sqlite, duckdb). Both of those options allege supporting file sizes well over what you would require for this data, and they both give you the option of querying subsets of the data as needed from R. If you don't know the DBI package or DBI way of doing things, I strongly suggest starting at https://dbi.r-dbi.org/ and https://db.rstudio.com/.
Splitting the file and then read piece-wise into R. I don't know if you can fit the entire data into R, but if you can and the act of reading them in is the hurdle, then
split --lines=1000000 onebigfile.csv smallerfiles.csv.
HDR=$(head -n 1 onebigfile.csv
sed -i -e "1i ${HDR}" smallerfiles.csv.*
sed -i -e "1d" smallerfiles.csv.aa
where 1000000 is the number of rows you want in each smaller file. You will find n files named smallerfiles.csv.aa, *.ab, *.ac, etc ... (depending on the size, perhaps you'll see three or more letters).\
The HDR= and first sed prepends the header row into all smaller files; since the first smaller file already has it, the second sed removes the duplicate first row.
Read each file individually into R or into the database. To bring into R, this would be done with something like:
files <- list.files("mypath", pattern = "^out.csv$", recursive = TRUE, full.names = TRUE)
library(data.table)
alldata <- rbindlist(lapply(files, fread))
assuming that R can hold all of the data at one time. If R cannot (either doing it this way or just reading onebigfile.csv above), then you really have no other options than a form of database[2].
To read them individually into the DBMS, you could likely do it in bash (well, any shell, just not R) and it would be faster than R. For that matter, though, you might as well combine into onebigfile.csv and do the command-line insert once. One advantage, however, of inserting individual files into the database is that, given a reasonably-simple bash script, you could read the data in from completed threads while other threads are still working; this provides mid-processing status cues and, if the run-time is quite long, might give you the ability to do some work before the processing is complete.
Notes:
"Shared filesystem": I'm assuming that these are not operating on a local-only filesystem. While certainly not impossible, most enterprise high-performance systems I've dealt with are based on some form of shared filesystem, whether it be NFS or GPFS or similar.
"Form of database": technically, there are on-disk file formats that support partial reads in R. While vroom:: can allegedly do memory-mapped partial reads, I suspect you might run into problems later as it may eventually try to read more than memory will support. Perhaps disk.frame could work, I have no idea. Other formats such as parquet or similar might be usable, I'm not entirely sure (nor do I have experience with them to say more than this).

Finding location of the current file

My question is exactly similar to this question. I tried all the solutions listed there but they didn't work :(
Only difference is that I am not sourcing other R files. I am going to read csv files that are at the same location as the current R script.
I need this feature as that way I can transfer R file easily to other PCs/systems
I want that solution to work on Rstudio and command line and on Windows and Linux.
I would like to offer a bounty of 50 credits
How about adding this to your script?
currentpath <- getwd()
Then you can read the csv file foo.csv with
read.csv(paste0(currentpath,'/','foo.csv'))
To make the code more platform independent you can explore the normalizePath function.

inet PDFc unable to recognize characters

I was using PDFc to compare two files using ConsoleResultHandle.
Both my files were similar , I had copy pasted them.
This tool Pdfc after comparing was giving
DEBUG - Unsupport CMap format: 6
I checked the differences folder (where it shows the differences) and in the png files it generates , its giving all boxes ..(unsupported characters) as the above debug says.
Did anyone else encounter the same problem.
This problem can occur if there are issues parsing the PDF file. The best way to proceed, as gamma mentions in the comment, is to contact our support team at pdfc#inetsoftware.de with the PDFs in question - we usually answer within 24 hours and will see if we can fix the problem.

Resources