I'm trying to get a list of subdirectories from a path. These subdirectories have a time pattern month\day\hour, i.e. 03\21\11.
I naively used the following:
list.files("path",pattern="[0-9]\[0-9]\[0-9]", recursive = TRUE, include.dirs = TRUE)
But it doesn't work.
How to code for the digitdigit\digitdigit\digitdigit pattern here?
Thank you
This Regex works for 10\11\18.
(\d\d\\\d\d\\\d\d)
I think you may need lazy matching for regex, unless there's always two digits - in which case other responses look valid.
If you could provide a vector of file name strings, that would be super helpful.
Capturing backslashes is confusing, I've found this thread helpful: R - gsub replacing backslashes
My guess is something like this: '[0-9]+?\\\\[0-9]+?\\\\[0-9]+'
Related
I'm trying to work out a way of extracting text files from multiple directories using fs:dir_ls and vroom.
The directory structure is essentially M:/instrument/project/experiment/measurements/time_stamp/raw_data/files*.txt.
Ideally, I want to be able to define the path to the experiment level then let the pattern take care of the rest, for example -
fs::dir_ls(path="M:/instrument/project/", glob = "experiment_*/2021-04-11*/raw_data/files*.txt", invert = TRUE, recurse = TRUE),
So I'm reading in all the .txt files across multiple experiment directories in one go, however, when I try this approach it returns all the files from the project level rather than those from the specific folders described by the pattern.
I've looked through the other SO questions on the topic covered here: Pattern matching using a wildcard, R list files with multiple conditions, list.files pattern argument in R, extended regular expression use, and grep using a character vector with multiple patterns, but haven't been able to apply them to my particular problem.
Any help is appreciated, I realise the answer is likely staring me in the face, I just need help seeing it.
Thanks
You can try the following with list.files :
files <- list.files('M:/Operetta/LED_Wound/operetta_export/plate_variability[540]/robot_seed_wide_plate_1[1614]/2021-05-10T113438+0100[1764]/SC_data', pattern = 'arpe19*\\.txt')
I'm trying to create a regular expression for when a pattern doesn't happen (specifically, I want to pull a list of folders and sub-folders from a drive, so I'm looking for anything that doesn't end in \\.[[:alnum:]]{1,4}$). Because this pattern goes into list.files, I can't use PERL-like lookahead (right?). Is there a way for me to achieve it other than first putting everything into a vector and then running a grep on it with lookahead?
I'm not too familiar with R's regex, but this seems to work for me:
'.*[^[:alnum:]].{0,3}$'
What it means is that you want at least one of the 3 last character to not be an alpha-num
files <- c("my_file", "script.php", "foo!faa", "test123.321tset", "colors.red")
files[grep(".*[^[:alnum:]].{0,3}$", files)]
# => "script.php" "foo!faa" "colors.red"
OK, this was stupid. The answer was staring me in the face the whole time - list.dirs lists only directories, while list.files lists all files. I'm not sure why trying the former at first didn't seem to give me the result I was looking for...
Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]
I have strings that look like a shown below. I need to extract part of the string that is between first // and first subsequent /. I use sub with perl = F but it's roughly 4 times slower than with perl = T. So I tried perl = T and found that search starts from the END of the string??
a = "https://moo.com/meh/woof//A.ds.serving/hgtht//ghhg/tjtke"
print(gsub(".*//(.*?)/.*","\\1",a))
"moo.com"
print(gsub(".*//(.*?)/.*","\\1",a,perl=T))
"ghhg"
moo.com is what I need. I am very surprised to see this - is it documented somewhere? How can I rewrite it with perl - I have 20M rows to work with, and speed is important. Thanks!
Edit: it is not given that every string will start with http
You can try .*?//(.*?)/.* to make the first .* lazy too so that // will match the first // instance:
gsub(".*?//(.*?)/.*","\\1",a,perl=T)
# [1] "moo.com"
And ?gsub says:
The standard regular-expression code has been reported to be very slow
when applied to extremely long character strings (tens of thousands of
characters or more): the code used when perl = TRUE seems much faster
and more reliable for such usages.
The standard version of gsub does not substitute correctly repeated
word-boundaries (e.g. pattern = "\b"). Use perl = TRUE for such
matches.
As part of a larger task performed in R run under windows, I would like to copy selected files between directories. Is it possible to give within R a command like cp patha/filea*.csv pathb (notice the wildcard, for extra spice)?
I don't think there is a direct way (shy of shelling-out), but something like the following usually works for me.
flist <- list.files("patha", "^filea.+[.]csv$", full.names = TRUE)
file.copy(flist, "pathb")
Notes:
I purposely decomposed in two steps, they can be combined.
See the regular expression: R uses true regex, and also separates the file pattern from the path, in two separate arguments.
note the ^ and $ (beg/end of string) in the regex -- this is a common gotcha, as these are implicit to wildcard-type patterns, but required with regexes (lest some file names which match the wildcard pattern but also start and/or end with additional text be selected as well).
In the Windows world, people will typically add the ignore.case = TRUE argument to list.files, in order to emulate the fact that directory searches are case insensitive with this OS.
R's glob2rx() function provides a convenient way to convert wildcard patterns to regular expressions. For example fpattern = glob2rx('filea*.csv') returns a different but equivalent regex.
You can
use system() to fire off a command as if it was on shell, incl globbing
use list.files() aka dir() to do the globbing / reg.exp matching yourself and the copy the files individually
use file.copy on individual files as shown in mjv's answer