Lets say I have a string in R:
str <- "abc abc cde cde"
and I use regmatches and gregexpr to find how many "b"´s there is in my string
regmatches(str, gregexpr("b",str))
but I want an output of everything that cointains the letter b.
So an output like: "abc", "abc".
Thank you!
tmp <- "abc abc cde cde"
Split the string up into separate elements, grep for "b", return elements:
grep("b", unlist(strsplit(tmp, split = " ")), value = TRUE)
Look for non-space before and after, something like:
regmatches(str, gregexpr("\\S*b\\S*", s))
# [[1]]
# [1] "abc" "abc"
Special regex characters are documented in ?regex. For this case, \\s matches "any space-like character", and \\S is its negation, so any non-space-like character. You could be more specific, such as \\w ('word' character, same as [[:alnum:]_]). The * means zero-or-more, and + means one-or-more (forcing something).
I suppose you mean you want to find words that contain b. One regex that does this is
\w*b\w*
\w* matches 0 or more word characters, which is a-z, A-Z, 0-9, and the underscore character.
Demo
Here is a base R option using strsplit and grepl:
str <- "abc abc cde cde"
words <- strsplit(str, "\\s+")[[1]]
idx <- sapply(words, function(x) { grepl("b", x)})
matches <- words[idx]
matches
[1] "abc" "abc"
Related
i'm trying to replace a part of a string which is matched like in the following example:
str1 <- "abc sdak+ 123+"
I would like to replace all + that come after 3 numbers, but not in the case when a + is coming after characters. I tried like this, but this replaces the whole matched string, when I only want to replace the + with a -
gsub("[0-9]{3}\\+", "-", str1)
The desired outcome should be:
"abc sdak+ 123-"
We could capture the 3 digits as a group ((...)) and the +, replace with the backreference (\\1) of the captured group and the -. Just to make sure that there is no digits before the 3 digits, use either word boundary (\\b) or a space (\\s)
gsub("\\b(\\d{3})\\+", "\\1-", str1)
-output
[1] "abc sdak+ 123-"
You can also use look-behind ie is the + symbol preceded by 3 numbers? if so, replace it.
str1 <- "abc sdak+ 123+"
gsub("(?<= [0-9]{3})\\+", "-", str1, perl = TRUE)
[1] "abc sdak+ 123-"
I have the following sample dataset:
XYZ 185g
ABC 60G
Gha 20g
How do I remove the strings "185g", "60G", "20g" without accidentally removing the alphabets g and G in the main words?
I tried the below code but it replaces the alphabets in the main words as well.
a <- str_replace_all(a$words,"[0-9]"," ")
a <- str_replace_all(a$words,"[gG]"," ")
You need to combine them into something like
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]$", "")
The \s*\d+[gG]$ regex matches
\s* - zero or more whitespaces
\d+ - one or more digits
[gG] - g or G
$ - end of string.
If you can have these strings inside a string, not just at the end, you may use
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]\\b", "")
where $ is replaced with a \b, a word boundary.
To ignore case,
a$words <- str_replace_all(a$words, regex("\\s*\\d+g\\b", ignore_case=TRUE), "")
You can try
> gsub("\\s\\d+g$", "", c("XYZ 185g", "ABC 60G", "Gha 20g"), ignore.case = TRUE)
[1] "XYZ" "ABC" "Gha"
You can also use the following solution:
vec <- c("XYZ 185g", "ABC 60G", "Gha 20g")
gsub("[A-Za-z]+(*SKIP)(*FAIL)|[ 0-9Gg]+", "", vec, perl = TRUE)
[1] "XYZ" "ABC" "Gha"
I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"
Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
This is my first post in stack overflow and I'll try and explain my problem as succintly as possible.
The problem is pretty simple. I'm trying to identify strings containing alphanumeric characters and alphanumeric characters with symbols and remove them. I looked at previous questions in Stack overflow and found a solution that looks good.
https://stackoverflow.com/a/21456918/7467476
I tried the provided regex (slightly modified) in notepad++ on some sample data just to see if its working (and yes, it works). Then, I proceeded to use the same regex in R and use gsub to replace the string with "" (code given below).
replace_alnumsym <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[_-])[A-Za-z0-9_-]{8,}", "", x, perl = T))
}
replace_alnum <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{8,}", "", x, perl = T))
}
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
output1 <- sapply(sample, replace_alnum)
output2 <- sapply(sample, replace_alnumsym)
The code runs fine but the output still contains the strings. It hasn't been removed. I'm not getting any errors when I run the code (output below). The output format is also strange. Each element is printed twice (once without and once within quotes).
> output1
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
The desired result would be:
> output1
abc def ghi abcd efgh WQWEQtWe_232
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh
I think I'm probably overlooking something very obvious.
Appreciate any assistance that you can provide.
Thanks
Your outputs are not printing twice, they're being output as named vectors. The unquoted line is the element names, the quoted line in the output itself. You can see this by checking the length of an output:
length( sapply( sample, replace_alnum ) )
# [1] 2
So you know there are only 2 elements there.
If you want them without the names, you can unname the vector on output:
unname( sapply( sample, replace_alnum ) )
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
Alternatively, you can rename them something more to your liking:
output <- sapply( sample, replace_alnum )
names( output ) <- c( "name1", "name2" )
output
# name1 name2
# "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
As far as the regex itself, it sounds like what you want is to apply it to each string separately. If so, and if you want them back to where they were at the end, you need to split them by space, then recombine them at the end.
# split by space (leaving results in separate list items for recombining later)
input <- sapply( sample, strsplit, split = " " )
# apply your function on each list item separately
output <- sapply( input, replace_alnumsym )
# recombine each list item as they looked at the start
output <- sapply( output, paste, collapse = " " )
output <- unname( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh "
And if you want to clean up the trailing white space:
output <- trimws( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh"
No idea if this regex-based approach is really fine, but it is possible if we assume that:
alnumsym "words" are non-whitespace chunks delimited with whitespace and start/end of string
alnum words are chunks of letters/digits separated with non-letter/digits or start/end of string.
Then, you may use
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
gsub("\\b(?=\\w*[a-z])(?=\\w*[A-Z])(?=\\w*\\d)\\w{8,}", "", sample, perl=TRUE) ## replace_alnum
gsub("(?<!\\S)(?=\\S*[a-z])(?=\\S*[A-Z])(?=\\S*[0-9])(?=\\S*[_-])[A-Za-z0-9_-]{8,}", "", sample, perl=TRUE) ## replace_alnumsym
See the R demo online.
Pattern 1 details:
\\b - a leading word boundary (we need to match a word)
(?=\\w*[a-z]) - (a positive lookahead) after 0+ word chars (\w*) there must be a lowercase ASCII letter
(?=\\w*[A-Z]) - an uppercase ASCII letter must be inside this word
(?=\\w*\\d) - and a digit, too
\\w{8,} - if all the conditions above matched, match 8+ word chars
Note that to avoid matching _ (\w matches _) you need to replace \w with [^\W_].
Pattern 2 details:
(?<!\\S) - (a negative lookbehind) no non-whitespace can appear immediately to the left of the current location (a whitespace or start of string should be in front)
(?=\\S*[a-z]) - after 0+ non-whitespace chars, there must be a lowercase ASCII letter
(?=\\S*[A-Z]) - the non-whitespace chunk must contain an uppercase ASCII letter
(?=\\S*[0-9]) - and a digit
(?=\\S*[_-]) - and either _ or -
[A-Za-z0-9_-]{8,} - if all the conditions above matched, match 8+ ASCII letters, digits or _ or -.
I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.
Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.
To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""