Matching multiple patterns

Matching multiple patterns - r

I want to see, if "001" or "100" or "000" occurs in a string of 4 characters of 0 and 1. For example, a 4 character string could be like "1100" or "0010" or "1001" or "1111". How do I match many strings in a string with a single command?
I know grep could be used for pattern matching, but using grep, I can check only one string at a time. I want to know if multiple strings can be used with some other command or with grep itself.

Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
grep(pattern, x)
[1] 1 2 3
This returns an index of which of your vectors contained the matching pattern (in this case the first three.)
Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:
grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE
See ?regex for help about regular expressions in R.
Edit:
To avoid creating pattern manually we can use paste:
myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")

Here is one solution using stringr package
require(stringr)
mylist = c("1100", "0010", "1001", "1111")
str_locate(mylist, "000|001|100")

Use the -e argument to add additional patterns:
echo '1100' | grep -e '001' -e '110' -e '101'

If you want logical vector then you should check stri_detect function from stringi package. In your case the pattern is regex, so use this one:
stri_detect_regex(x, pattern)
## [1] TRUE TRUE TRUE FALSE
And some benchmarks:
require(microbenchmark)
test <- stri_paste(stri_rand_strings(100000, 4, "[0-1]"))
head(test)
## [1] "0001" "1111" "1101" "1101" "1110" "0110"
microbenchmark(stri_detect_regex(test, pattern), grepl(pattern, test))
Unit: milliseconds
expr min lq mean median uq max neval
stri_detect_regex(test, pattern) 29.67405 30.30656 31.61175 30.93748 33.14948 35.90658 100
grepl(pattern, test) 36.72723 37.71329 40.08595 40.01104 41.57586 48.63421 100

Sorry for making this an additonal answer, but it is too many lines for a comment.
I just wanted to remind, that the number of items that can be pasted together via paste(..., collapse = "|") to be used as a single matching pattern is limited - see below. Maybe somebody can tell where exactly the limit is? Admittedly the number might not be realistic, but depending on the task to be performed it should not entirely be excluded from our considerations.
For a really large number of items, a loop would be required to check each item of the pattern.
set.seed(0)
samplefun <- function(n, x, collapse){
paste(sample(x, n, replace=TRUE), collapse=collapse)
}
words <- sapply(rpois(10000000, 8) + 1, samplefun, letters, '')
text <- sapply(rpois(1000, 5) + 1, samplefun, words, ' ')
#since execution takes a while, I have commented out the following lines
#result <- grepl(paste(words, collapse = "|"), text)
# Error in grepl(pattern, text) :
# invalid regular expression
# 'wljtpgjqtnw|twiv|jphmer|mcemahvlsjxr|grehqfgldkgfu|
# ...
#result <- stringi::stri_detect_regex(text, paste(words, collapse = "|"))
# Error in stringi::stri_detect_regex(text, paste(words, collapse = "|")) :
# Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG)

You can also use the %like% operator from data.table library.
library(data.table)
# input
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
# check for pattern
x %like% pattern
> [1] TRUE TRUE TRUE FALSE

Related

Using regex to remove the filepath and the trailing _# of a filename in a sub command [duplicate]

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?

There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"

basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"

You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'

fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"

You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.

If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.

Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}

The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100

Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

How to get rid of the file extensions in the output of list.files in R [duplicate]

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?

There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"

basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"

You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'

fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"

You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.

If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.

Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}

The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100

Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

Extracting and matching regular expressions in R

I have a list of strings, an example is shown below (the actual list has a much bigger variety in format)
[1] "AB-123"
[2] "AB-312"
[3] "AB-546"
[4] "ZXC/123456"
Assuming [1] is the correct format, I want to extract the regular expression from [1] and match it against the rest to detect that [4] is inconsistent. Is there a method to do this or is there a better way to achieve the same outcome?
*EDIT - I found something close to what I require, anyone know of any packages that does this?
Given a string, generate a regex that can parse *similar* strings

We may need grep
grepl(sub("-.*", "", v1[1]), v1[-1])
data
v1 <- c( "AB-123" , "AB-312" , "AB-546" , "ZXC/123456")

Here's an attempt at making a function which checks if each value is a Character Digit or Other. It is a bit rough but I'm sure this can be expanded upon to match exactly what you want:
test <- c("AB-123", "AB-312", "AB-546", "ZXC/123456")
compare_1st <- function(x) {
x <- toupper(x)
chars <- list("A",1,"-")
repl <- c("[A-Z]", "[0-9]", "[^0-9A-Z]")
for(i in seq_along(repl)) x <- gsub(repl[i], chars[i], x)
out <- x[1] == x
attr(out, "values") <- chartr("A1-", "CDO", x)
out
}
compare_1st(test)
#[1] TRUE TRUE TRUE FALSE
#attr(,"values")
#[1] "CCODDD" "CCODDD" "CCODDD" "CCCODDDDDD"

Get filename without extension in R

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?

There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"

basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"

You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'

fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"

You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.

If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.

Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}

The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100

Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

Remove numbers at the beginning and end of a string

I've got the following vector:
words <- c("5lang","kasverschil2","b2b")
I want to remove "5" in "5lang" and "2" in "kasverschil2". But I do NOT want to remove "2" in "b2b".

gsub("^\\d+|\\d+$", "", words)
#[1] "lang" "kasverschil" "b2b"
Another option would be to use stringi
library(stringi)
stri_replace_all_regex(words, "^\\d+|\\d+$", "")
#[1] "lang" "kasverschil" "b2b"
Using a variant of the data set provided by the OP here are benchmarks for 3 three main solutions (note that these strings are very short and contrived; results may differ on a larger, real data set):
words <- rep(c("5lang","kasverschil2","b2b"), 100000)
library(stringi)
library(microbenchmark)
GSUB <- function() gsub("^\\d+|\\d+$", "", words)
STRINGI <- function() stri_replace_all_regex(words, "^\\d+|\\d+$", "")
GREGEXPR <- function() {
gregexpr(pattern='(^[0-9]+|[0-9]+$)', text = words) -> mm
sapply(regmatches(words, mm, invert=TRUE), paste, collapse="")
}
microbenchmark(
GSUB(),
STRINGI(),
GREGEXPR(),
times=100L
)
## Unit: milliseconds
## expr min lq median uq max neval
## GSUB() 301.0988 349.9952 396.3647 431.6493 632.7568 100
## STRINGI() 465.9099 513.1570 569.1972 629.4176 738.4414 100
## GREGEXPR() 5073.1960 5706.8160 6194.1070 6742.1552 7647.8904 100

You can use gsub which uses regular expressions:
gsub("^[0-9]|[0-9]$", "", words)
# [1] "lang" "kasverschil" "b2b"
Explanation:
The pattern ^[0-9] matches any number at the beginning of a string, while the pattern [0-9]$ matches any number at the end of the string. by separating these two patterns by | you want to match either the first or the second pattern. Then, you replace the matched pattern with an empty string.

Get instances where numbers appear at the beginning or end of a word and match everything else. You need to collapse results because of possible multiple matches:
gregexpr(pattern='(^[0-9]+|[0-9]+$)', text = words) -> mm
sapply(regmatches(words, mm, invert=TRUE), paste, collapse="")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Matching multiple patterns - r

Here is one solution using stringr package require(stringr) mylist = c("1100", "0010", "1001", "1111") str_locate(mylist, "000|001|100")

Use the -e argument to add additional patterns: echo '1100' | grep -e '001' -e '110' -e '101'

You can also use the %like% operator from data.table library. library(data.table) # input x <- c("1100", "0010", "1001", "1111") pattern <- "001|100|000" # check for pattern x %like% pattern > [1] TRUE TRUE TRUE FALSE

Related

Using regex to remove the filepath and the trailing _# of a filename in a sub command [duplicate]

How to get rid of the file extensions in the output of list.files in R [duplicate]

Extracting and matching regular expressions in R

Get filename without extension in R

Remove numbers at the beginning and end of a string

Categories

Resources