List.files from web accessible folder - r

I need to be able to fill a vector with the files in a web accessible folder.
The variables used in setting the image_dir variable are set to the following:
msu_path = "http://oer.hpc.msstate.edu/okeanos/"
sub$cruiseID = "EX1504L2"
divespecs$specID = EX1504L2_20150802T223100_D2_DIVE01_SPEC01GEO/
image_dir = http://oer.hpc.msstate.edu/okeanos/ex1504l2/EX1504L2_20150802T223100_D2_DIVE01_SPEC01GEO/
image_dir <- sprintf("%s%s/%s", msu_path, tolower(sub$cruiseID), divespecs$specID)
file_names <- list.files(path = image_dir, pattern = "jpg", ignore.case=TRUE)
If I do this exact thing but use a path within my working directory, it works fine. The image_dir does receive a valid URL using the above setting, but it returns file_names as a character (empty) value instead of a vector of the .jpg files in that WAF.
Thanks for any advice.

Related

R: How to match a forward-slash in a regular expression?

How do I match on a forward slash / in a regular expression in R?
As demonstrated in the example below, I am trying to search for .csv files in a subdirectory and my attempts to use a literal / are failing. Looking for a modification to my regex in base R, not a function that does this for me.
Example subdirectory
# Create subdirectory in current working directory with two .csv files
# - remember to delete these later or they'll stay in your current working directory!
dir.create(path = "example")
write.csv(data.frame(x1 = letters), file = "example/example1.csv")
write.csv(data.frame(x2 = 1:20), file = "example/example2.csv")
Get relative paths of all .csv files in the example subdirectory
# This works for the example, but could mistakenly return paths to other files based on:
# (a) file name: foo/example1.csv
# (b) subdirectory name: example_wrong/foo.csv
list.files(pattern = "example.*csv", recursive = TRUE)
#> [1] "example/example1.csv" "example/example2.csv"
# This fixes issue (a) but doesn't fix issue (b)
list.files(pattern = "^example.*?\\.csv$", recursive = TRUE)
#> [1] "example/example1.csv" "example/example2.csv"
# Adding / to the end of `example` guarantees we get the correct subdirectory
# Doesn't work: / is special regex and not escaped
list.files(pattern = "^example/.*?\\.csv$", recursive = TRUE)
# Doesn't work: escapes / but throws error
list.files(pattern = "^example\/.*?\\.csv$", recursive = TRUE)
# Doesn't work: even with the \\ escaping in R!
list.files(pattern = "^example\\/.*?\\.csv$", recursive = TRUE)
Some of the solutions above work with regex tools but not in R. I've checked SO for solutions (most related below) but none seem to apply:
Escaping a forward slash in a regular expression
Regex string does not start or end (or both) with forward slash
Reading multiple csv files from a folder with R using regex
The pattern argument is only used for matching file (or directory) names, not the full path they are on (even when recursive and full.names are set to TRUE). That's why your last approach doesn't work even though it is the correct way to match / in a regular expression. You can get the correct file names by specifying path and setting full.names to TRUE.
list.files(path='example', pattern='\\.csv$', full.names=T)

Renaming files with the same name

I'm having troubles with renaming files. I want to keep my other files with the same name but adding an identifier ("_1","_2","_3"..."_x") for example.
I'm using the package "R.utils" and the renameFile function.
directory <- "C:\\Users\\javie\\Documents\\Programacion\\Script\\Prueba_Rename"
pattern <- "_TED version=_1.0___DD__RE_XXXXXXXX-2__RE__TD_52__TD__F"
setwd(directory)
renameFile(list.files(pattern = pattern),
str_replace(list.files(pattern = papatternron),
pattern = pattern, ""),
overwrite=FALSE)
If the file already exist, it returns
"Error: File already exists: Guia_De_Despacho__77166.pdf"
which is good, because I don't want to overwriting. But how can change the name of the file ann add an identifier if it already exists?

A copying nightmare, choosing files to copy based on files in another folder

I have a bit of an issue with using copy.file.
I need to copy .tif files from a directory with several subdirectories (where the .tif files are) based on names of those in another file directory. I have the following code (which is almost working)
ValidatedDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Validated"
RawDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Raw"
OutputDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Ouputfolder"
ValidatedImages <- list.files(ValidatedDirectory)
# this is to remove the extra bit that is added onto the validated images [working]
pattern <- gsub("_hc", "", ValidatedImages)
pattern <- paste(gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", pattern), collapse="|")
# this bit tackles finding the relevant files based on the ValidatedImages
filesinRAW <- list.files(
path = RawDirectory,
recursive = TRUE,
include.dirs = FALSE,
full.names = FALSE)
filesinRAW <- as.list(filesinRAW)
# this removes subdirectory prefix in front of the file and .tif which confuses it
filesinRAW <- as.list(gsub("\\d\\d\\d\\d/", "", filesinRAW))
filesinRaw <- as.list(gsub(".tif", "", filesinRAW))
tocopy <- grep(filesinRAW, pattern = pattern, value = TRUE)
tocopy <- as.list(tocopy)
tocopy <- as.list(gsub(".tif", "", tocopy))
setwd(RawDirectory)
file.copy(from = tocopy, to = OutputDirectory, overwrite = TRUE)
I get the No such file or directory error, the files do exist (obviously), thus I must be doing something wrong with the naming.
I have been having a bash at it for a good while, if helpful I can upload the example data and share the link.
Thanks for any help community!
When debugging, try to break down your code to see if at each step your variables are still as you're expect them.
That said, I see several problems in your code right now:
grep works with pattern being a length-one regular expression. If you give it multiple regular expressions, it uses the first one (with a warning, which you don't see if you've disabled them).
To use multiple matches, you can use apply and sapply: filesinRAW[apply(sapply(pattern, grepl, x=filesinRAW), 2, any)]. But see the last point
grep by default uses pattern as a regular expression, which may break things if your pattern contains characters that are parsed. For example, grep('^test', '^test') gives zero results. To check if a string contains a literal string, you can use grep(..., fixed=TRUE)
In the last step, you use sub(".tif", "", to copy), which will remove any patterns like .tif. I suppose you meant to add .tif again at the end, right now you are trying to copy files without an extension, which won't be found. To add, you can use paste.
In several steps you use as.list. Why? In R, everything is vectorised, meaning multiple values are already used. The difference between a list and a vector is that lists can store different kinds of objects, but you're not doing that anyway. As far as I can see, the as.lists don't harm anything, because all the functions will as a first step convert your list back to a character-vector.
Finally, as far I can see you're first making a list of filenames that need to be copied (pattern), that you then compare to a full list of your files. And you try to make them match exactly. Then why use a regular expression? Regular expressions are useful if you just know a part of what your filenames look like, but is that your goal. E.g. if filename1._hc is in your ValidatedDirectory, do the files filename11.tif and filename12.tif need to be copied as well?
If you're just looking for exact matches, you can directly compare them:
tocopy <- tocopy[tocopy %in% pattern]
But generally, working in R is easy because you can do everything step-by-step, and if you just inspect tocopy, you can see whether your call makes sense.
After much help from #Emil Bode I have the following solution to the issue (perhaps not the most elegant, but it runs quick enough on 1000s of .tif files.
ValidatedDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Validated"
RawDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Raw"
OutputDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Ouputfolder"
ValidatedImages <- list.files(ValidatedDirectory)
pattern <- gsub("_hc", "", ValidatedImages)
pattern <- paste(gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", pattern), collapse="|")
filesinRAW <- list.files(
path = RawDirectory,
recursive = TRUE,
include.dirs = FALSE,
full.names = FALSE,
pattern = pattern)
setwd(RawDirectory)
file.copy(from = filesinRAW, to = OutputDirectory, overwrite = TRUE)

R, obtain complete file path string in files names in Windows (spaces and more)

Certainly an old issue, but I was not able to find a solution (maybe there are none). On Unix it is straight forward to use the R function file.path to obtain the path to some file. How can the same thing be done under Windows when spaces in paths return with ~ .
If I need to write, say the path to Rscript.exe to a file, this would work on unix:
x <- list.files(R.home("bin"), full.names = T, pattern = "Rscript")
writeLines(x, con = "path_to_rscript.txt")
On Windows the result is:
C:/PROGRA~1/R/R-35~1.1/bin/x64/Rscript.exe
Where I would have wanted something like:
C:/Program Files/R-3.5.1/bin/x64/Rscript.exe
Is there a way to circumvent this behavior (and what is it with the capitalized PROGRA ?).
Indeed, checkout normalizePath:
normalizePath(path, winslash = "\\", mustWork = NA)
which states explicitly:
On Windows it converts relative paths to absolute paths, converts
short names for path elements to long names and ensures the separator
is that specified by winslash. It will match paths case-insensitively
and return the canonical case. UTF-8-encoded paths not valid in the
current locale can be used.

glob2rx, placing a wildcard in the middle of expression and specificying exeptions, r

I have am writing an R script that performs a function for all files in a series of subdirectories. I have ran into a problem where several files in these subdirectories are being recognized by my glob2rx function, and I need help refining my pattern so I can select the file I want.
Here is an example of my directory structure:
subdir1
file1_aaa_111_subdir1.txt
file1_bbb_111_subdir1.txt
file1_aaa_subdir1.txt
subdir2
file1_aaa_111_subdir2.txt
file1_bbb_111_subdir2.txt
file1_aaa_subdir2.txt
I want to select for the last file in each directory, although in my actual directory its position is varied. I want to use something like:
inFilePaths = list.files(path=".", pattern=glob2rx("*aaa*.txt"), full.names=TRUE)
but I dont get any files. In looking at this pattern, I would in theory get both the first and last file in each directory. Meaning I need to write an exception to exclude the aaa_111 files, and keep the aaa_subdir files.
There is a second option I have been thinking about, but lack the ability to realize. Notice the name of the subdirectory is at the end of each file name. Is it possible to extract the directory name, and then combine it with a glob2rx pattern, and then directly specify which file I want? Like this:
#list all the subdirectories
subDirsPaths = list.dirs(path=".", full.names=TRUE)
#perform a function on these directories one by one
for (subDirsPath in subDirsPaths){
#make the subdirectory the working directory
setwd("/home/phil/Desktop/working")
setwd(paste(subDirsPath, sep=""))
# get the working directory name, and trim the "./" from it
directory <- gsub("./", "", paste(subDirsPath, sep=""))
# attempt to the get the desired file by pasting the directory name into the glob2rx funtion
inFilePaths = list.files(path=".", pattern=glob2rx("*aaa_", print(directory,".txt")), full.names=TRUE)
for (inFilePath in inFilePaths)
{
inFileData <- read_tsv(inFilePath, col_names=TRUE)
}
}
With some modification the second option worked well. I ended up using paste in combination with print as follows:
inFilePaths = list.files(path=".", pattern=glob2rx(print(paste("*", "aaa_", directory, ".txt", sep=""))), full.names=TRUE)
The paste function combined the text into a single string, which also preserved the wildcard. The print function added this to the list.files function as the glob2rx pattern.
While this doesn't allow me to place a wild card in the middle of an expression, which I believe is done use an escape character, and it doesn't address the need to place exceptions on the wild card, it works for my purposes.
I hope this helps others in my position.

Resources