Use R fs::dir_ls to match the beginning of file name? - r

I'm trying to use fs::dir_ls() to return the same results as the list.files() example below. Ultimately, I'm just trying to return files that start with a specific pattern.
path <- "./path/to/files"
pattern <- "^ABC_.*\\.csv$"
# list files returns the expected output
list.files(path = path, pattern = pattern, full.names = T)
# [1] "path/to/files/ABC_1312.csv"
# [2] "path/to/files/ABC_ACAB.csv"
# dir_ls does not return any matching files
fs::dir_ls(path = path, regexp = pattern)
# character(0)
I think the issue here is that the scope of each method's pattern argument differs. The list.files() pattern is only applied to the basename() of the file path, whereas, the dir_ls() regexp argument is being applied to the full path. As a result, the ^ regex is being applied to the start of the path, instead of the beginning of each file. Is there a way to limit the scope of dir_ls() to only match patterns on the basename() of each file similar to list.files()? Any other insights are appreciated!

See this issue on GitHub:
you need to modify your regular expression to match the full path then, or use a filtering function that only looks at the basename.
Use
pattern <- paste0(.Platform$file.sep, "ABC_.*\\.csv$)
You can also do something like
regexp = fs::path(path, pattern)

Related

List files that end with pattern and lack an extension

I have a directory with multiple subdirectories that contain files.
The files themselves have no extension; however, each file has an additional header file with the extension ".hdr".
In R, I want to list all file names that contain the string map_masked and end with the pattern "masked", but I only want the files without an extension (the ones that end with the pattern, not the header files).
As suggested in this answer, I tried to use the $ sign to indicate the pattern should occur at the end of a line.
This is the code I used:
dir <- "/my/directory"
list.files(dir, pattern = "map_masked|masked$", recursive = TRUE)
The output, however, looks as follows:
[1] "subdirectory/something_map_masked_something_masked"
[2] "subdirectory/something_map_masked_something_masked.hdr"
etc.
Now, how do I tell R to exclude the files that have an ".hdr" extension?
I am aware this could easily be done by applying a filter on the output, but I would rather like to know what is wrong with my code and understand why R behaves the way it does in this case.
You can use
list.files(dir, pattern = "map_masked.*masked$", recursive = TRUE)
It returns filepaths that contain map_masked and end with masked string.
Details:
map_masked - a fixed string
.* - any zero or more chars as many as possible
masked - a masked substring
$ - end of string.
See the regex demo.

R: How to match a forward-slash in a regular expression?

How do I match on a forward slash / in a regular expression in R?
As demonstrated in the example below, I am trying to search for .csv files in a subdirectory and my attempts to use a literal / are failing. Looking for a modification to my regex in base R, not a function that does this for me.
Example subdirectory
# Create subdirectory in current working directory with two .csv files
# - remember to delete these later or they'll stay in your current working directory!
dir.create(path = "example")
write.csv(data.frame(x1 = letters), file = "example/example1.csv")
write.csv(data.frame(x2 = 1:20), file = "example/example2.csv")
Get relative paths of all .csv files in the example subdirectory
# This works for the example, but could mistakenly return paths to other files based on:
# (a) file name: foo/example1.csv
# (b) subdirectory name: example_wrong/foo.csv
list.files(pattern = "example.*csv", recursive = TRUE)
#> [1] "example/example1.csv" "example/example2.csv"
# This fixes issue (a) but doesn't fix issue (b)
list.files(pattern = "^example.*?\\.csv$", recursive = TRUE)
#> [1] "example/example1.csv" "example/example2.csv"
# Adding / to the end of `example` guarantees we get the correct subdirectory
# Doesn't work: / is special regex and not escaped
list.files(pattern = "^example/.*?\\.csv$", recursive = TRUE)
# Doesn't work: escapes / but throws error
list.files(pattern = "^example\/.*?\\.csv$", recursive = TRUE)
# Doesn't work: even with the \\ escaping in R!
list.files(pattern = "^example\\/.*?\\.csv$", recursive = TRUE)
Some of the solutions above work with regex tools but not in R. I've checked SO for solutions (most related below) but none seem to apply:
Escaping a forward slash in a regular expression
Regex string does not start or end (or both) with forward slash
Reading multiple csv files from a folder with R using regex
The pattern argument is only used for matching file (or directory) names, not the full path they are on (even when recursive and full.names are set to TRUE). That's why your last approach doesn't work even though it is the correct way to match / in a regular expression. You can get the correct file names by specifying path and setting full.names to TRUE.
list.files(path='example', pattern='\\.csv$', full.names=T)

Have error in file(con, "r") : cannot open the connection when do lapply

I have a folder with about 100 file txt. I only run simpl code:
> setwd("E:/Yunlin/SMUNPO/TXTFILE/")
> filenames <- list.files(getwd(),pattern="*.txt")
> textfiles <- lapply(filenames, readLines)
However, the result is Error in file(con, "r") : cannot open the connection. I tried to set the working directory, change the file name to be simple, but none of it works. I test with readLines function for a specific file name. It works. But not for all the folder. Anyone can help, thank you in advanced?
You must use regex-style patterns in pattern, not glob-style.
From ?list.files:
pattern: an optional regular expression. Only file names which match
the regular expression will be returned.
So it is expecting regex, not glob-style patterns.
Use one of these options:
list.files(pattern = "\\.txt$")
list.files(pattern = utils::glob2rx("*.txt"))
(To learn regex, I suggest both https://stackoverflow.com/a/22944075/3358272 and https://www.regular-expressions.info/. Note that backslashes in regular expressions usually need to be double-blackslashes; for example, \b (word boundary) in R needs to be \\b.)
You should use full.names=TRUE, precluding the need for the setwd/getwd dance. I suggest something like:
# no need for `setwd`
filenames <- list.files("E:/Yunlin/SMUNPO/TXTFILE/", pattern = "\\.txt$", full.names = TRUE)

R - Find the location of the file

Below is the wrapper function I created to find the file location. The function works, but I would like to know if there is any simpler solution than this.
The purpose of this function is to find the folder of the file. Since list.files returns the directory and the file name, I cant use this as an input for setwd().
setwd(list.files(fileName)) will not work
Questions:
Is there any function which will give the folder so I dont have to create a wrapper function?
How can I find the last "/" in a string. I played with regexpr("\\\[^\\.]*$", Dir) and kept getting error.
Any answers or feedbacks are greatly appreciated.
Code:
findFileLocation <- function(FileName,...) {
#Find the location of the file
Dir <- list.files(pattern = FileName, recursive = TRUE)
#> Dir
#[1] "10-30/No time line/folderNames.csv"
positionOfDot <- regexpr("\\.[^\\.]*$", Dir)
#> positionOfDot
#[1] 18
numCharFile <- nchar(FileName)
#> numCharFile
#[1] 15
numCharDir <- nchar(Dir)
#> numCharDir
#[1] 21
fileDir <- substr(Dir, 1, (numCharDir-(numCharFile+1))) #+1 is to account for the "/"
fileDir #returns the actual location of the file
}
test <- findFileLocation("folderNames.csv")
from here I can execute the code:
setwd(file.path(mainDir, test))
Note: basename and dirname are already tried.
Thanks to #MrFlick. The answer is dirname(list.files(pattern = FileName, recursive = TRUE))
Since the first question was already answered, let me answer the second question here:
How can I find the last "/" in a string. I played with regexpr("\\\[^\\.]*$", Dir) and kept getting error.
The error message I get when I try to use this regular expression is:
Error: '[' is an unrecognized escape in character string starting ""\["
The problem reported here is that a third backslash is used (\) where in fact a forward slash (/) was intended. Using regexpr("\\/[^\\.]*$", Dir) instead doesn't throw any errors. However, it doesn't do what was intended, i.e. it does not find the last forward slash. This is because this regular expression searches for forward slashes that are not followed by any dots (.), where in fact the idea was to search for forward slashes that are not followed by any (more) forward slashes.
Thus, the correct regular expression for the described use case is regexpr("\\/[^\\/]*$", Dir).
Dir <- "10-30/No time line/folderNames.csv"
regexpr("\\/[^\\/]*$", Dir)
# returns 19

Find files with R console

there is the list.file function and a pattern in the function. However, how can i find things in a fold if it has the ending yto and within the name there is a 111 (it has to be before the dot)?
As #Roland writes, you can use the list.files function. Run ?list.files for the documentation hereof.
Say we wish to list all .txt files in foldes (and subfolders). Then something like
list.files(pattern = "\\.txt$", recursive = TRUE)
would do the trick. The recursive argument makes the function search subfolders as well. The pattern we're looking for the regular expression "\\.txt$" meaning that the filename should end with .txt. Consult ?regex for more information on regular expressions.
EDIT: If you search for files which ends in 111.tx you then need to modify the above to:
list.files(pattern = "111\\.tx$", recursive = TRUE)

Resources