I'm trying to remove strings that contain a specific character pattern. My data looks somethink like this:
places <- c("copenhagen", "copenhagens", "Berlin", "Hamburg")
I would like to remove all elements that contain "copenhagen", i.e. "copenhagen" and "copenhagens".
But I was only able to come up with the following code:
library(stringr)
replacement.vector <- c("copenhagen", "copenhagens")
for(i in 1:length(replacement.vector)){
places = lapply(places, FUN=function(x)
gsub(paste0("\\b",replacement.vector[i],"\\b"), "", x))
I'm looking fo a function that enables me to remove all elements that contain "copenhagen" without having to specify whether or not the element also includes other letters.
Best,
Dose
Based on the OP's code, it seems like we need to subset the 'places'. In that case, it may be better to use grep with invert= TRUE argument
grep("copenhagen", places, invert=TRUE, value = TRUE)
#[1] "Berlin" "Hamburg"
or use grepl and negate (!)
places[!grepl("copenhagen", places)]
#[1] "Berlin" "Hamburg"
Related
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}
I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]
I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.
In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
Search-and-replace an element in a data frame given a list of replacements.
Code:
testing123tmp <- data.frame(x=c("it's", "not", "working"))
testing123tmp$x <- as.character(testing123tmp$x)
tmp <- list("it's" = "hey", "working"="dead")
apply(testing123tmp,2,function(x) gsubfn('.', tmp, x))
Expected Output:
x
[1,] hey
[2,] not
[3,] dead
My current output:
x
[1,] "it's"
[2,] "not"
[3,] "working"
Been looking around for possible solution in chartr and gsub, but would like simplicity (short coding) given multiple gsub is required for such operation. Also my variable tmp can be scaled to many-pair replacement such that:
tmp <- list("it's" = "hey",
"working"="dead",
"other" = "other1",
.. = .. ,
.. = .. ,
.. = .. )
Edit/Update #1:
would also like solution in gsubfn above and data-framed
The issues are these:
The dot only matches one character so it will never match an entire string unless that entire string has one character and therefore no name in tmp will ever be matched. Use ".*" to match the entire string. If you wanted to match words, i.e. there are possibly several words separated by whitespace in each component of x so that for example one component of x might be "it's not" and we still wanted to match it's then use "\\S+". There are other variations one could imagine as well and this gives a framework that encompasses many of them.
the third argument to gsubfn can already be a vector and gsubfn will iterate over it so it is not necessary to use apply. (It will still work with apply but it is unnecessary.)
to keep everything in a data frame one easy way is to use transform as shown below (or alternately use transform2, also in the gsubfn package). The x will automatically refer to the x column in the testing123tmp data frame and transform will produce a new data frame not overwriting the original. If you want to keep these separate assign the result of transform to a new name or if you want to overwrite testing123tmp then assign it back to testing123tmp.
we can use stringsAsFactors = FALSE to avoid generating character columns.
testing123tmp <- data.frame(x=c("it's", "not", "working"), stringsAsFactors = FALSE)
Thus we can reduce the code to:
transform(testing123tmp, y = gsubfn(".*", tmp, x))
giving the following data.frame:
x y
1 it's hey
2 not not
3 working dead
If we wanted to overwrite the x column rather than keep separate input and output columns we could have used x = ... in the transform statement instead of y = ... .
You may write
gsubfn(".*", tmp, testing123tmp$x)
# [1] "hey" "not" "dead"
and then
testing123tmp$x <- gsubfn(".*", tmp, testing123tmp$x)
As for your approach, there was no need for apply as gsubfn is vectorized over that parameter, and the problem was to match only .---one symbol, while it's and working are of varying length.
However, if you are replacing one word with another word, then there is no need for regex. For instance,
idx <- testing123tmp$x %in% names(tmp)
testing123tmp$x[idx] <- unlist(tmp)[testing123tmp$x[idx]]
should work faster. If the task is more involved, then I guess
library(stringr)
str_replace_all(testing123tmp$x, unlist(tmp))
# [1] "hey" "not" "dead"
should be more robust than gsubfn as you don't need to deal with patterns like .*.
If I have a vector of strings:
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
and want to find the entires with '2011' in the string I can use
ifiles <- dd[grep("2011",dd)]
How do I search for entries with a combination of strings included, without using a loop?
For example, I would like to find the entries with both '2011' and 'sprd' in the string, which in this case will only return
sflxgrbfg_sprd_2011
How can this be done? I could define a variable
toMatch <- c('2011','sprd)
and then loop through the entries but I was hoping there was a better solution?
Note: To make this useful for different strings. Is it also possible to to determine which entries have these strings without them being in the order shown. For example, 'sflxlgrbfg_2011_sprd'
If you want to find more than one pattern, try indexing with a logical value rather than the number. That way you can create an "and" condition, where only the string with both patterns will be extracted.
ifiles <- dd[grepl("2011",dd) & grepl("sprd_",dd)]
Try
grep('2011_sprd|sprd_2011', dd, value=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
Or using an example with more patterns
grep('(?<=sprd_).*(?=2011)|(?<=2011_).*(?=sprd)', dd1,
value=TRUE, perl=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
#[3] "sfxl_2011_14334_sprd" "sprd_124334xsff_2011_1423"
data
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012",
"sflxlgrbfg_2011_sprd")
dd1 <- c(dd, "sfxl_2011_14334_sprd", "sprd_124334xsff_2011_1423")
If you want a scalable solution, you can use lapply, Reduce and intersect to:
For each expression in toMatch, find the indices of all matches in dd.
Keep only those indices that are found for all expressions in toMatch.
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
dd <- c(dd, "sflxgrbfh_sprd_2011")
toMatch <- c('bfg', '2011','sprd')
dd[Reduce(intersect, lapply(toMatch, grep, dd))]
#> [1] "sflxgrbfg_sprd_2011" "sflxgrbfg_sprd2_2011"
Created on 2018-03-07 by the reprex package (v0.2.0).