subset strings without a pattern stringr - r

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.

In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

Related

Add space before a character with gsub (R) [duplicate]

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

Match pattern so long as it doesn't contain a specific string

Let's say I have the following strings:
quiz.1.player.chat_results
and
partner_quiz.1.player.chat_results
I have hundreds of strings like this where the only difference is that one is prefixed with "partner" and the other is not. I'm trying to match one but not the other.
The specific pattern I'd like to match looks like so:
index <- grep('^(quiz.)[1-5]{1}.player.chat_results', names(data))
But this will match both strings. I'm guessing I have to use some negative lookahead like so:
^((?!partner).)
But I'm not sure where to use it.
I'll answer your title question, as it will be the most useful to other people finding this question.
How to match strings that do not contain a given pattern? Easy, match the pattern and invert it.
index <- grep('^partner', names(data), invert = TRUE)
Another approach: use str_detect from stringr
> library(stringr)
> str_detect(string, "partner", negate=TRUE)
[1] TRUE FALSE
You can even use one grepl and negate the result
> !grepl("partner", string)
[1] TRUE FALSE
Just for fun: you can split the string using as separator \\. or _ and then iterate over each element of the resulting list comparing each element to partner and finally invert the result
> sapply(strsplit(string, "\\.|_"), function(x) !"partner" %in% x)
[1] TRUE FALSE
We can use two grepl to avoid any confusion
grepl('quiz', names(data)) & !grepl('partner', names(data))
#[1] TRUE FALSE
For someone who is a bit regex-blind like myself, sub can help,
sub('_.*', '', x) == 'partner'
#[1] TRUE FALSE
If you want to match the pattern including the digits, you could use a word boundary \b followed by a negative lookahead (?!partner) to assert what is directly on the right is not partner.
Note to escape the dot to match it literally and you can omit {1}. If you are not the value of the captured group around quiz, you might omit it as well.
To match the rest of the string, you might use \S+ to match not a non whitespace char.
\b(?!partner)quiz\.[1-5]\.player\S*
Regex demo | R demo
For example
regmatches(txt1,regexpr("\\b(?!partner)quiz\\.[1-5]\\.player\\S*",txt, per=TRUE))

stringr Package replace characters

I am trying to replace each character in a string with following rules using stringr package:
replace_characters <- function(x){str_replace_all(x,c("A"="N",'B'='O','C'='P','D'='Q','E'='R','F'='S','G'='T','H'='U','I'='V','J'='W','K'='X','L'='Y','M'='Z',
'N'='A','O'='B','P'='C','Q'='D','R'='E','S'='F','T'='G','U'='H','V'='I','W'='J','X'='K','Y'='L','Z'='M','0'='5','1'='6','2'='7','3'='8','4'='9','5'='0','6'='1','7'='2','8'='3','9'='4'))}
and then i tried the function with a random string:
replace_characters("HSNKSL584")
and i got:
"HFAKFL034"
as you can see some of the letters(numbers) have been replaced as expected but some remain unchanged.
Can anyone explain the reason for me?
Thanks!
Behind the scenes, stringr::str_replace_all calls upon stringi's stri_replace_all_* functions. If you used a named vector to describe multiple replacement patterns (which is the case here), the corresponding parameters fed into stri_replace_all_* includes vectorize_all = FALSE.
From stri_replace_all_*'s help file:
However, for stri_replace_all*, if vectorize_all is FALSE, the each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str
<- stri_replace_all(str, pattern[i], replacement[i]...
As raymkchow & Sotos have noted in the comments, when you cycle through the replacement patterns one by one sequentially, some are going to get affected more than once, effectively reversing the replacement from an earlier cycle.
We can do this with chartr from base R
chartr("HSNKL584", "UFAXY039", "HSNKSL584")
#[1] "UFAXFY039"
This can be made into a function
replace_char_fun <- function(str1) {
old <- paste(c(LETTERS, 0:9), collapse="")
new <- paste(c(LETTERS[14:26], LETTERS[1:13], 5:9, 0:4), collapse="")
chartr(old, new, str1)
}
replace_char_fun( "HSNKSL584")
#[1] "UFAXFY039"

Match & Replace String, utilising the original string in the replacement, in R

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

R: How to remove a string containing a specific character pattern?

I'm trying to remove strings that contain a specific character pattern. My data looks somethink like this:
places <- c("copenhagen", "copenhagens", "Berlin", "Hamburg")
I would like to remove all elements that contain "copenhagen", i.e. "copenhagen" and "copenhagens".
But I was only able to come up with the following code:
library(stringr)
replacement.vector <- c("copenhagen", "copenhagens")
for(i in 1:length(replacement.vector)){
places = lapply(places, FUN=function(x)
gsub(paste0("\\b",replacement.vector[i],"\\b"), "", x))
I'm looking fo a function that enables me to remove all elements that contain "copenhagen" without having to specify whether or not the element also includes other letters.
Best,
Dose
Based on the OP's code, it seems like we need to subset the 'places'. In that case, it may be better to use grep with invert= TRUE argument
grep("copenhagen", places, invert=TRUE, value = TRUE)
#[1] "Berlin" "Hamburg"
or use grepl and negate (!)
places[!grepl("copenhagen", places)]
#[1] "Berlin" "Hamburg"

Resources