R: How to match a forward-slash in a regular expression? - r

How do I match on a forward slash / in a regular expression in R?
As demonstrated in the example below, I am trying to search for .csv files in a subdirectory and my attempts to use a literal / are failing. Looking for a modification to my regex in base R, not a function that does this for me.
Example subdirectory
# Create subdirectory in current working directory with two .csv files
# - remember to delete these later or they'll stay in your current working directory!
dir.create(path = "example")
write.csv(data.frame(x1 = letters), file = "example/example1.csv")
write.csv(data.frame(x2 = 1:20), file = "example/example2.csv")
Get relative paths of all .csv files in the example subdirectory
# This works for the example, but could mistakenly return paths to other files based on:
# (a) file name: foo/example1.csv
# (b) subdirectory name: example_wrong/foo.csv
list.files(pattern = "example.*csv", recursive = TRUE)
#> [1] "example/example1.csv" "example/example2.csv"
# This fixes issue (a) but doesn't fix issue (b)
list.files(pattern = "^example.*?\\.csv$", recursive = TRUE)
#> [1] "example/example1.csv" "example/example2.csv"
# Adding / to the end of `example` guarantees we get the correct subdirectory
# Doesn't work: / is special regex and not escaped
list.files(pattern = "^example/.*?\\.csv$", recursive = TRUE)
# Doesn't work: escapes / but throws error
list.files(pattern = "^example\/.*?\\.csv$", recursive = TRUE)
# Doesn't work: even with the \\ escaping in R!
list.files(pattern = "^example\\/.*?\\.csv$", recursive = TRUE)
Some of the solutions above work with regex tools but not in R. I've checked SO for solutions (most related below) but none seem to apply:
Escaping a forward slash in a regular expression
Regex string does not start or end (or both) with forward slash
Reading multiple csv files from a folder with R using regex

The pattern argument is only used for matching file (or directory) names, not the full path they are on (even when recursive and full.names are set to TRUE). That's why your last approach doesn't work even though it is the correct way to match / in a regular expression. You can get the correct file names by specifying path and setting full.names to TRUE.
list.files(path='example', pattern='\\.csv$', full.names=T)

Related

List files that end with pattern and lack an extension

I have a directory with multiple subdirectories that contain files.
The files themselves have no extension; however, each file has an additional header file with the extension ".hdr".
In R, I want to list all file names that contain the string map_masked and end with the pattern "masked", but I only want the files without an extension (the ones that end with the pattern, not the header files).
As suggested in this answer, I tried to use the $ sign to indicate the pattern should occur at the end of a line.
This is the code I used:
dir <- "/my/directory"
list.files(dir, pattern = "map_masked|masked$", recursive = TRUE)
The output, however, looks as follows:
[1] "subdirectory/something_map_masked_something_masked"
[2] "subdirectory/something_map_masked_something_masked.hdr"
etc.
Now, how do I tell R to exclude the files that have an ".hdr" extension?
I am aware this could easily be done by applying a filter on the output, but I would rather like to know what is wrong with my code and understand why R behaves the way it does in this case.
You can use
list.files(dir, pattern = "map_masked.*masked$", recursive = TRUE)
It returns filepaths that contain map_masked and end with masked string.
Details:
map_masked - a fixed string
.* - any zero or more chars as many as possible
masked - a masked substring
$ - end of string.
See the regex demo.

Use R fs::dir_ls to match the beginning of file name?

I'm trying to use fs::dir_ls() to return the same results as the list.files() example below. Ultimately, I'm just trying to return files that start with a specific pattern.
path <- "./path/to/files"
pattern <- "^ABC_.*\\.csv$"
# list files returns the expected output
list.files(path = path, pattern = pattern, full.names = T)
# [1] "path/to/files/ABC_1312.csv"
# [2] "path/to/files/ABC_ACAB.csv"
# dir_ls does not return any matching files
fs::dir_ls(path = path, regexp = pattern)
# character(0)
I think the issue here is that the scope of each method's pattern argument differs. The list.files() pattern is only applied to the basename() of the file path, whereas, the dir_ls() regexp argument is being applied to the full path. As a result, the ^ regex is being applied to the start of the path, instead of the beginning of each file. Is there a way to limit the scope of dir_ls() to only match patterns on the basename() of each file similar to list.files()? Any other insights are appreciated!
See this issue on GitHub:
you need to modify your regular expression to match the full path then, or use a filtering function that only looks at the basename.
Use
pattern <- paste0(.Platform$file.sep, "ABC_.*\\.csv$)
You can also do something like
regexp = fs::path(path, pattern)

A copying nightmare, choosing files to copy based on files in another folder

I have a bit of an issue with using copy.file.
I need to copy .tif files from a directory with several subdirectories (where the .tif files are) based on names of those in another file directory. I have the following code (which is almost working)
ValidatedDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Validated"
RawDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Raw"
OutputDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Ouputfolder"
ValidatedImages <- list.files(ValidatedDirectory)
# this is to remove the extra bit that is added onto the validated images [working]
pattern <- gsub("_hc", "", ValidatedImages)
pattern <- paste(gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", pattern), collapse="|")
# this bit tackles finding the relevant files based on the ValidatedImages
filesinRAW <- list.files(
path = RawDirectory,
recursive = TRUE,
include.dirs = FALSE,
full.names = FALSE)
filesinRAW <- as.list(filesinRAW)
# this removes subdirectory prefix in front of the file and .tif which confuses it
filesinRAW <- as.list(gsub("\\d\\d\\d\\d/", "", filesinRAW))
filesinRaw <- as.list(gsub(".tif", "", filesinRAW))
tocopy <- grep(filesinRAW, pattern = pattern, value = TRUE)
tocopy <- as.list(tocopy)
tocopy <- as.list(gsub(".tif", "", tocopy))
setwd(RawDirectory)
file.copy(from = tocopy, to = OutputDirectory, overwrite = TRUE)
I get the No such file or directory error, the files do exist (obviously), thus I must be doing something wrong with the naming.
I have been having a bash at it for a good while, if helpful I can upload the example data and share the link.
Thanks for any help community!
When debugging, try to break down your code to see if at each step your variables are still as you're expect them.
That said, I see several problems in your code right now:
grep works with pattern being a length-one regular expression. If you give it multiple regular expressions, it uses the first one (with a warning, which you don't see if you've disabled them).
To use multiple matches, you can use apply and sapply: filesinRAW[apply(sapply(pattern, grepl, x=filesinRAW), 2, any)]. But see the last point
grep by default uses pattern as a regular expression, which may break things if your pattern contains characters that are parsed. For example, grep('^test', '^test') gives zero results. To check if a string contains a literal string, you can use grep(..., fixed=TRUE)
In the last step, you use sub(".tif", "", to copy), which will remove any patterns like .tif. I suppose you meant to add .tif again at the end, right now you are trying to copy files without an extension, which won't be found. To add, you can use paste.
In several steps you use as.list. Why? In R, everything is vectorised, meaning multiple values are already used. The difference between a list and a vector is that lists can store different kinds of objects, but you're not doing that anyway. As far as I can see, the as.lists don't harm anything, because all the functions will as a first step convert your list back to a character-vector.
Finally, as far I can see you're first making a list of filenames that need to be copied (pattern), that you then compare to a full list of your files. And you try to make them match exactly. Then why use a regular expression? Regular expressions are useful if you just know a part of what your filenames look like, but is that your goal. E.g. if filename1._hc is in your ValidatedDirectory, do the files filename11.tif and filename12.tif need to be copied as well?
If you're just looking for exact matches, you can directly compare them:
tocopy <- tocopy[tocopy %in% pattern]
But generally, working in R is easy because you can do everything step-by-step, and if you just inspect tocopy, you can see whether your call makes sense.
After much help from #Emil Bode I have the following solution to the issue (perhaps not the most elegant, but it runs quick enough on 1000s of .tif files.
ValidatedDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Validated"
RawDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Raw"
OutputDirectory <- "C:/Users/JS22/Desktop/R_Experiments/Raw_Folder_Testa/Ouputfolder"
ValidatedImages <- list.files(ValidatedDirectory)
pattern <- gsub("_hc", "", ValidatedImages)
pattern <- paste(gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", pattern), collapse="|")
filesinRAW <- list.files(
path = RawDirectory,
recursive = TRUE,
include.dirs = FALSE,
full.names = FALSE,
pattern = pattern)
setwd(RawDirectory)
file.copy(from = filesinRAW, to = OutputDirectory, overwrite = TRUE)

R - Find the location of the file

Below is the wrapper function I created to find the file location. The function works, but I would like to know if there is any simpler solution than this.
The purpose of this function is to find the folder of the file. Since list.files returns the directory and the file name, I cant use this as an input for setwd().
setwd(list.files(fileName)) will not work
Questions:
Is there any function which will give the folder so I dont have to create a wrapper function?
How can I find the last "/" in a string. I played with regexpr("\\\[^\\.]*$", Dir) and kept getting error.
Any answers or feedbacks are greatly appreciated.
Code:
findFileLocation <- function(FileName,...) {
#Find the location of the file
Dir <- list.files(pattern = FileName, recursive = TRUE)
#> Dir
#[1] "10-30/No time line/folderNames.csv"
positionOfDot <- regexpr("\\.[^\\.]*$", Dir)
#> positionOfDot
#[1] 18
numCharFile <- nchar(FileName)
#> numCharFile
#[1] 15
numCharDir <- nchar(Dir)
#> numCharDir
#[1] 21
fileDir <- substr(Dir, 1, (numCharDir-(numCharFile+1))) #+1 is to account for the "/"
fileDir #returns the actual location of the file
}
test <- findFileLocation("folderNames.csv")
from here I can execute the code:
setwd(file.path(mainDir, test))
Note: basename and dirname are already tried.
Thanks to #MrFlick. The answer is dirname(list.files(pattern = FileName, recursive = TRUE))
Since the first question was already answered, let me answer the second question here:
How can I find the last "/" in a string. I played with regexpr("\\\[^\\.]*$", Dir) and kept getting error.
The error message I get when I try to use this regular expression is:
Error: '[' is an unrecognized escape in character string starting ""\["
The problem reported here is that a third backslash is used (\) where in fact a forward slash (/) was intended. Using regexpr("\\/[^\\.]*$", Dir) instead doesn't throw any errors. However, it doesn't do what was intended, i.e. it does not find the last forward slash. This is because this regular expression searches for forward slashes that are not followed by any dots (.), where in fact the idea was to search for forward slashes that are not followed by any (more) forward slashes.
Thus, the correct regular expression for the described use case is regexpr("\\/[^\\/]*$", Dir).
Dir <- "10-30/No time line/folderNames.csv"
regexpr("\\/[^\\/]*$", Dir)
# returns 19

Find files with R console

there is the list.file function and a pattern in the function. However, how can i find things in a fold if it has the ending yto and within the name there is a 111 (it has to be before the dot)?
As #Roland writes, you can use the list.files function. Run ?list.files for the documentation hereof.
Say we wish to list all .txt files in foldes (and subfolders). Then something like
list.files(pattern = "\\.txt$", recursive = TRUE)
would do the trick. The recursive argument makes the function search subfolders as well. The pattern we're looking for the regular expression "\\.txt$" meaning that the filename should end with .txt. Consult ?regex for more information on regular expressions.
EDIT: If you search for files which ends in 111.tx you then need to modify the above to:
list.files(pattern = "111\\.tx$", recursive = TRUE)

Resources