In order to ease the manual copying of large file amounts, I often use FreeFileSync. I noticed that it preserves the original file information such as when a file was created, last modified etc.
Now I need to regularly copy tons of files in batch mode and I'd like to do it in R. So I wondered if R is capable of preserving that information as well. AFAIU, file.rename() and file.copy() alter the file information, e.g. the times are set to the time the files were actually copied.
Is there any way I can restore the original file information after the files have been copied?
Robocopy via system2() can keep the timestamps.
> cmdArgs<- paste( normalizePath( file.path(getwd()), winslash="/"),
normalizePath( file.path(getwd(), "bkup"), winslash="/" ),
"*.txt",
"/copy:DAT /V" )
> system2( "robocopy.exe", args=cmdArgs )
Robocopy has a slew of switches for all different types of use cases and can accept a 'job' file for the params and file names. The ability of R to call out using system could also be used to execute an elevated session (perhaps the easiest would be by using a powershell script to call Robocopy) so that all of the auditing info (permissions and such) could be retained as well.
Related
I have already finished with my RMarkdown and I'm trying to clean up the workspace a little bit. This isn't exactly a necessary thing but more of an organizational practice which I'm not even sure if it's a good practice, so that I can keep the data separate from some scripts and other R and git related files.
I have a bunch of .csv files for data that I used. Previously they were on (for example)
C:/Users/Documents/Project
which is what I set as my working directory. But now I want them in
C:/Users/Document/Project/Data
The problem is that this only breaks the following code because they are not in the wd.
#create one big dataframe by unioning all the data
bigfile <- vroom(list.files(pattern = "*.csv"))
I've tried adding a full path to list.files() to where the csvs are but no luck.
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv"))
Error: 'data1.csv' does not exist in current working directory ('C:/Users/Documents/Project').
Is there a way to only access the /Data folder once for creating my dataframe with vroom() instead of changing the working directory multiple times?
You can list files including those in all subdirectories (Data in particular) using list.files(pattern = "*.csv", recursive = TRUE)
Best practices
Have one directory of raw and only raw data (the stuff you measured)
Have another directory of external data (e.g. reference data bases). This is something you do can remove afterwards and redownload if required.
Have another directory for the source code
Put only the source code directory under version control plus one other file containing check sums of the raw and external data to proof integrity
Every other thing must be reproducible using raw data and the source code. This can be removed after the project. Maybe you want to keep small result files (e.g. tables) which take long time to reproduce.
You can list the files and capture the full filepath name right?
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv", full.names = T))
and that should read the file in the directory without reference to your wd
Try one of these:
# list all csv files within Data within current directory
Sys.glob("Data/*.csv")
# list all csv files within immediate subdirectories of current directory
Sys.glob("*/*.csv")
If you only have csv files then these would also work but seem less desirable. Might be useful though if you quickly want to review what files and directories are there. (I would be very careful not to use the second one within statements to delete files since if you are not in the directory you think it is in then you can wind up deleting files you did not intend to delete. The first one might too but is a bit safer since it would only lead to deleting wrong files if the directory you are in does have a Data subdirectory.)
# list all files & directories within Data within current directory
Sys.glob("Data/*")
# list all files & directories within immediate subdirectories of current directory
Sys.glob("*/*")
If the subfolder always has the same name (or the same number of characters), you should be able to do it thanks to substring. In your example, "Data" has 4 characters (5 with the /), so the following code should do:
Repository <- substring(getwd(), 1, nchar(getwd())-5)
I have 48 scripts used to clean data corresponding to 48 different tests. The cleaning protocols for each test used to be unique and test-specific, but after some time the final project guideline allows that all tests may use the same cleaning protocol granted they save all output files to the appropriate directory (each test's own folder of results). I'm trying to combine these tests into one master cleaning script that can be used by any team member to clean data as more is collected, or make small changes, given they have the raw data files and a folder for each test (that I would give to them).
Currently I have tried two approaches:
The first is to include all necessary libraries in the body of a master cleaning script, then source() each individual cleaning script. Inside each script, the libraries are the require()ed, the appropriate files are read in, and code for the files are saved to their correct destination. This method seems to work best, but if the whole script is run, some subtests are successfully cleaned and saved to their correct locations, and the rest need to be saved individually--I'm not sure why.
library(readr)
library(dplyr)
library(data.table)
library(lubridate)
source("~/SF_Cleaning_Protocol.R")
etc
.
.
The second is the save the body of the general cleaning script as a function, and then call that function in a series of if statements based on the test one wants to clean.
For example:
if (testname == "SF"){
setwd("~/SF")
#read in the csv file
subtest<- read_csv()
path_map<- read_csv()
SpecIDs<- read_csv()
CleaningProtocol(subtest,path_map,SpecIDs)
write.csv("output1.csv")
write.csv("output2.csv")
write.csv("output3.csv")
write.csv("output4.csv")
} else if (testname == "EV"){
etc
}
The code reads in and prints out files fine if selected individually, but when testname is specified and the script is run as a whole, it ignores the if statements, runs all test, but fails to print results for any.
Is there a better option I haven't tried, or can anyone help me diagnose my issues?
Many thanks.
QFile::rename description says:
If the rename operation fails, Qt will attempt to copy this file's
contents to newName, and then remove this file, keeping only newName.
That is undesirable. I need to call QFile::rename only if the file can be renamed without copying (e. g. remains on the same disk drive on Windows). Is there a function in Qt that can perform this check (without me having to code it manually for every platform)?
I've ended up getting and checking the drive number on Windows (PathGetDriveNumber) and drive ID on Unix (stat function and st_dev field of the stat structure). Seems to work as expected so far.
I use parSapply() from parallel package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table(), because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table() executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?
It is now possible to create file locks using filelock (GitHub)
In order to facilitate this with parSapply() you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again or Sys.sleep() for a short amount of time. However, I am not certain how this will affect your performance.
Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file.
If size is an issue then you can use disk.frame to work with files that are larger than your system RAM.
The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory:
if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`
What is the correct method for enter data(d=read.table("WHAT GOES HERE IF YOU HAVE A MACBOOK ") if you have a mac computer?
Also what does the error code list below mean:
d=read.table(“Firststatex.notepad”,header=T)
Error: unexpected input in "d=read.table(‚"
Two usage errors:
You don't use data() to read in to R datasets held in external files. data() is an R function to load datasets that are built in to R and R packages. read.table("foo.txt") will return a data frame object from the file "foo.txt", which you can assign to an object within R using the assignment operator <-, e.g.
DF <- read.table("foo.txt")
As for "what goes here...", you need to supply a file system path from the current directory to the directory holding the file you want to read in. If the file "foo.txt" is in the current working directory, you can just provide the file name with extension as I did above. If the file is in another directory you need to supply the path to the file name and the file name, for example if the file "foo.txt" is located in the directory above the current directory, you would supply "../foo.txt". If it were in a directory myData located in the directory above the current directory you could us "../myData/foo.txt". So paths can be relative to the current directory. You can also use the fully qualified path on your file system hierarchy.
An alternative is to use the file.choose() function in place of the file name string. This will allow you to navigate to the file you wish to load interactively using a native file selection dialogue. This is what happens on Windows and I suspect also on Mac; not much different happens on Linux. For example:
DF <- read.table(file.choose())
You should probably look for specific help for your operating system if you are not familiar with how to specify file names and paths.
I get the same error when copying and pasting in the code you provide. The problem comes from the fact that you are using fancy, curly quotes “Firststatex.notepad” rather than one of the three sets of accepted quote marks: ` , ", and '; each of these is acceptable, i) "Firststatex.notepad", ii) 'Firststatex.notepad', and iii) `Firststatex.notepad` Just because the quotes you used look like quotes to you or I, these aren't quotes as far as most computer programs recognise. MS Word often inserts these quotes when you enter " for example, as do many other applications.