Get a list of files with exceptions using pattern - r

Using only the pattern argument from the list.files() function how could I get a file list excluding some files with similar pattern?
Let's say I have this files in my working directory:
med_t_1_1.csv, 01_t_1_1.csv, 02_t_1_1.csv, 03_t_1_1.csv,
med_t_2_1.csv, 01_t_2_1.csv, 02_t_2_1.csv, 03_t_2_1.csv
I want to get the files with the pattern t_1_1 but the one that starts with med:
01_t_1_1.csv, 02_t_1_1.csv, 03_t_1_1.csv

file_chrs <- c("med_t_1_1.csv", "01_t_1_1.csv", "02_t_1_1.csv", "03_t_1_1.csv",
"med_t_2_1.csv", "01_t_2_1.csv", "02_t_2_1.csv", "03_t_2_1.csv")
file_chrs[grepl("\\d_t_1_1", file_chrs)] # \\d matches and digit [0-9]
# console
[1] "01_t_1_1.csv" "02_t_1_1.csv" "03_t_1_1.csv"
# so in your working directory
list.files( pattern = "\\d_t_1_1" )

You can use regular expression
S[grepl("(?<!med_)t_1_1", S, perl=TRUE)]
# "01_t_1_1.csv" "02_t_1_1.csv" "03_t_1_1.csv"
Explanation of regex
(?<!med_) = (?< looks behind, ! does not match, med_ is the string
Look behind for string that does not match the string med_
t_1_1 = t_1_1 is string
Look for any string that matches t_1_1
**Other example
S1 <- c("med_t_1_1.csv", "S_t_1_1.csv", "04_t_1_1.csv")
S1[grepl("(?<!med_)t_1_1", S1, perl=TRUE)]
# "S_t_1_1.csv" "04_t_1_1.csv"

Related

Extract string with digits and special characters in r

I have a list of filenames in the format "filename PID00-00-00" or just "PID00-00-00".
I want to extract part of the filename to create an ID column.
I am currently using this code for the string extraction
names(df) <- stringr::str_extract(names(df), "(?<=PID)\\d+")
binded1 = rbindlist(df, idcol = "ID")%>%
as.data.frame(binded1)
This gives the ID as the first set of digits after PID. e.g. filename PID1234-00-01 becomes ID 1234.
I want to also extract the first hyphen and following digits. So from filename PID1234-00-01 I want 1234-00.
What should my regex be?
try this:
stringr::str_extract(names(df),"(?<=PID)\\d{4}-\\d{2}")

replace element of string 3 - 10 positions after a pattern

Ideally, in base R I need some kind of string manipulation that will let me detect a pattern and change the string 3 positions after the pattern.
example <- "when string says SOMETHING = #c792ea"
desired output:
when string says SOMETHING = #001628
I have tried gsub but I am not sure how I can get it to replace the characters after a pattern.
If it based on the position of character, then we can use substring assignment
substring(example, 30) <- "#001628"
example
#[1] "when string says SOMETHING = #001628"
Or if we need to find the position of the word that starts with #
library(stringr)
posvec <- c(str_locate(example, "#\\w+"))
substring(example, posvec[1], posvec[2]) <- "#001628"
# // or with
# str_sub(example, posvec[1], posvec[2]) <- "#001628"
Another option is sub to change the substring after the = and one or more space (\\s*)
sub("=\\s*.*", "= #001628", example)
#[1] "when string says SOMETHING = #001628"

RegEx for a conditional pattern in a string

I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")

how to match the start and end of an expression with grep in R

I am trying to match the start and end of expressions with grep command, but I am not able to do that. for example consider the following expressions.
filenames <- c("S2abc.6h", "S2abc.4h", "S2abc.0h","S4abc.6h","S2xyz.6h")
I want to fins all the files starting with S2 and ending with 6h. I can select the files starting with S2 using:
grep("S2", filenames, value = TRUE)
But I am not able to use the wild cards with grep.
> grep("S2*6h", filenames, value = TRUE)
character(0)
I think your approach to use grep is fine, you only need to slightly tweak the regular expression you use to match the filenames you want.
> matches <- grep("^S2.*\\.6h$", filenames, ignore.case = T)
> matches
[1] 1 5
> filenames[matches]
[1] "S2abc.6h" "S2xyz.6h"
The regex I used is:
^S2.*\\.6h$
This will match any filename which begins with S2 and ends with .6h
You can use ^ to determine the start and $ to determine the end of a string. The .+ catches everything inbetween.
grep("^S2.+6h$", filenames, value = TRUE)
# [1] "S2abc.6h" "S2xyz.6h"

Text Mining in a string using R

I recently started using R and a newbie for data analysis.
Is it possible in R to find the number of repetitions in a single main string of data when a string of data is used for searching through it?
Example:
Main string: 'abcdefghikllabcdefgllabcd'
and search string: 'lla'
Desired output: 'abcdefghik lla bcdefg lla bcd'
[I tried using grep() function of R, but It is not working in the desired way and only gives the number of repetitions of search string in multiple main strings.]
Thank you in advance.
This works too using regex capture groups:
gsub("(lla)"," \\1 ","abcdefghikllabcdefgllabcd")
Try the gsub() method like this:
main_string <- 'abcdefghikllabcdefgllabcd'
search_string <- 'lla'
output_string <- gsub(search_string, paste(' ', search_string, ' ', sep = ''), main_string)
Your question says that you might want to just COUNT the number of occurrences of the search tring in the main string. If that is the case, try this one liner:
string = "abcdefghikllabcdefgllabcd"
search = 'lla'
( nchar(string) - nchar( gsub(search, "", string)) ) / nchar(search)
#returns 2
string2 = "llaabcdefghikllabcdefgllabcdlla"
( nchar(string2) - nchar( gsub(search, "", string2)) ) / nchar(search)
#returns 4
NOTE: Unit-test your solution for matches at the beginning and end of the string (i.e. make sure it works on 'llaabcdefghikllabcdefgllabcdlla'). I have seen several solutions elsewhere that rely on strsplit() to split on 'lla', but these solutions skip the final 'lla' at the end of the word.

Resources