find names that match either of two patters [duplicate] - r

This question already has answers here:
grep using a character vector with multiple patterns
(11 answers)
Closed 2 years ago.
Is is possible to find the names in a vector that contain either id OR group Or both in the example below?
I have used grepl() without success.
a = c("c-id" = 2, "g_idgroups" = 3, "z+i" = 4)
grepl(c("id", "group"), names(a)) # return name of elements that contain either `id` OR `group` OR both

You can use :
pattern <- c("id", "group")
grep(paste0(pattern, collapse = '|'), names(a), value = TRUE)
#[1] "c-id" "g_igroups"
With grepl you can get logical value
grepl(paste0(pattern, collapse = '|'), names(a))
#[1] TRUE TRUE FALSE
A stringr solution :
stringr::str_subset(names(a), paste0(pattern, collapse = '|'))
#[1] "c-id" "g_igroups"

Using str_detect:
> names(a)[str_detect(names(a), 'id|groups')]
[1] "c-id" "g_idgroups"
> names(a)
[1] "c-id" "g_idgroups" "z+i"
>

Related

Remove period and text after in list [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
gsub() in R is not replacing '.' (dot)
(3 answers)
Closed last year.
I have a list
t <- list('mcd.norm_1','mcc.norm_1', 'mcr.norm_1')
How can i convert the list to remove the period and everything after so the list is just
'mcd' 'mcc' 'mcr'
You may try
library(stringr)
lapply(t, function(x) str_split(x, "\\.", simplify = T)[1])
Another possible solution:
library(tidyverse)
t <- list('mcd.norm_1','mcc.norm_1', 'mcr.norm_1')
t %>%
str_remove("\\..*")
#> [1] "mcd" "mcc" "mcr"
This could be another option:
unlist(sapply(t, \(x) regmatches(x, regexec(".*(?=\\.)", x, perl = TRUE))))
[1] "mcd" "mcc" "mcr"

How to get the most frequent character within a character string? [duplicate]

This question already has answers here:
Finding the most repeated character in a string in R
(2 answers)
Closed 1 year ago.
Suppose the next character string:
test_string <- "A A B B C C C H I"
Is there any way to extract the most frequent value within test_string?
Something like:
extract_most_frequent_character(test_string)
Output:
#C
We can use scan to read the string as a vector of individual elements by splitting at the space, get the frequency count with table, return the named index that have the max count (which.count), get its name
extract_most_frequent_character <- function(x) {
names(which.max(table(scan(text = x, what = '', quiet = TRUE))))
}
-testing
extract_most_frequent_character(test_string)
[1] "C"
Or with strsplit
extract_most_frequent_character <- function(x) {
names(which.max(table(unlist(strsplit(x, "\\s+")))))
}
Here is another base R option (not as elegant as #akrun's answer)
> intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string))))))
[1] "C"
One possibility involving stringr could be:
names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE))))
[1] "C"
Or marginally shorter:
names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))
Here is solution using stringr package, table and which:
library(stringr)
test_string <- str_split(test_string, " ")
test_string <- table(test_string)
names(test_string)[which.max(test_string)]
[1] "C"

How to use "and" with grepl? [duplicate]

This question already has answers here:
Is it possible to use an AND operator in grepl()?
(2 answers)
Closed 4 years ago.
This what I have:
f=5.20
y=168.9850
dat=c("dat.txt","dat_5.20.txt","data_5.20_168.9850.txt")
Filter(function(x) grepl(f, x), dat)
# [1] "dat_5.20.txt" "data_5.20_168.9850.txt"
I need to grep only the one obtained f and y
How to use both f and y in grepl?
The desired result would be:
"data_5.20_168.9850.txt"
One pure regex way of doing this would be to just use two lookahead assertions which independently check for the presence of each of the number strings:
f <- "5\\.20"
y <- "168\\.9850"
dat <- c("dat.txt","dat_5.20.txt","data_5.20_168.9850.txt")
grepl(paste0("(?=.*", f, ")(?=.*", y, ")"), dat, perl=TRUE)
[1] FALSE FALSE TRUE
The pattern used here is (?=.*5\.20)(?=.*168\.9850).
I suppose if you had a long set of search strings and you didn't want to have to type out everything you could do:
dat[Reduce("&", lapply(c(f,y), function(x, dat) grepl(x, dat), dat = dat))]
However, you could probably also get around typing everything out using #TimBiegeleisen's method by doing something like: paste0("(?=.*", c(f,y), ")", collapse = "") and using the result as your search string.
We can do two grep's using any of these alternatives:
grep(y, grep(f, dat, value = TRUE), value = TRUE)
## [1] "data_5.20_168.9850.txt"
dat[grepl(f, dat) & grepl(y, dat)]
## [1] "data_5.20_168.9850.txt"
dat[ intersect(grep(f, dat), grep(y, dat)) ]
## [1] "data_5.20_168.9850.txt"

R's grepl() to find multiple strings exists [duplicate]

This question already has answers here:
R regex to find two words same string, order and distance may vary
(2 answers)
Closed 2 years ago.
grepl("instance|percentage", labelTest$Text)
will return true if any one of instance or percentage is present.
How will I get true only when both the terms are present?
Text <- c("instance", "percentage", "n",
"instance percentage", "percentage instance")
grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE
grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE
The latter one works by looking for:
('instance')(any character sequence)('percentage')
OR
('percentage')(any character sequence)('instance')
Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.
Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.
# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))
# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)
# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)
# they produce identical results
identical(longperl, longstrd)
Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you
pat <- c("instance", "percentage", "element", "character")
longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L
As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:
tx <- c("cent element", "percentage element", "element cent", "element centimetre")
grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE
This is how you will get only "TRUE" if both terms do occur in an item of the vector "labelTest$Text".
I think this is the exact answer to the question and much shorter than the other solutions.
grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text)
Use intersect and feed it a grep for each word:
library(data.table) #used for subsetting text vector below
vector_of_text[
intersect(
grep(vector_of_text , pattern = "pattern1"),
grep(vector_of_text , pattern = "pattern2")
)
]

What is the use of 'sep' in paste command of R? [duplicate]

This question already has answers here:
Concatenate a vector of strings/character
(8 answers)
Closed 6 years ago.
I was working with the paste command in R, when I found that
a <- c("something", "to", "paste")
paste(a, sep="_")
produces the output
# [1] "something" "to" "paste"
Which is same as when I print "a"
# [1] "something" "to" "paste"
So what effect does the sep have on the paste command in R?
sep is more generally applicable when you have more than two vectors of length greater than 1. If you were looking to get "something_to_paste", then you would be looking for the collapse argument.
Try the following to get a sense of what the sep argument does:
paste(a, 1:3, sep = "_")
# [1] "something_1" "to_2" "paste_3"
and compare it to collapse:
paste(a, collapse = "_")
# [1] "something_to_paste"

Resources