I am trying to do a dataframe string substitution in R. I need to find all the words preceded by '#' (without space, e.g. #word) and change the '#' for '!' (e.g. from #word to !word). At the same time, it leaves intact the other instances of '#' (e.g. # or ## or #[#]). For example, this is my original dataframe (to change: #def, #jkl, #stu):
> df = data.frame(number = 1:4, text = c('abc #def ghi', '#jkl # mno', '#[#] pqr #stu', 'vwx ### yz'))
> df
number text
1 1 abc #def ghi
2 2 #jkl # mno
3 3 #[#] pqr #stu
4 4 vwx ### yz
And this is what I need it to look like:
> df_result = data.frame(number = 1:4, text = c('abc !def ghi', '!jkl # mno', '#[#] pqr !stu', 'vwx ### yz'))
> df_result
number text
1 1 abc !def ghi
2 2 !jkl # mno
3 3 #[#] pqr !stu
4 4 vwx ### yz
I have tried with
> gsub('#.+[a-z] ', '!', df$text)
[1] "abc !ghi" "!# mno" "!#stu" "vwx ### yz"
But the result is not the desired one. Any help is much appreciated.
Thank you.
How about
gsub("(^| )#(\\w)", "\\1!\\2", df$text)
# [1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
This matches an # symbol at beginning of a string, or after a space. Then, we capture the word character after the # symbol, and replace # with !.
Explanation courtesy of regex101.com:
(^| ) is the 1st Capturing Group; ^ asserts position at start of the string; | denotes "or"; blank space matches the space character literally
# matches the character # literally (case sensitive)
(\\w) is the 2nd Capturing Group, it denotes a word character
The replacement string \\1!\\2 replaces the regular expression match with the first capturing group (\\1), followed by !, followed by the second capturing group (\\2).
You can use a positive lookahead (?=...)
gsub("#(?=[A-Za-z])", "!", df$text, perl = TRUE)
[1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
From the "Regular Expressions as used in R" documentation page:
Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed.
Related
How can I drop "-" or double "--" only at the beginning of the value in the text column?
df <- data.frame (x = c(12,14,15,178),
text = c("--Car","-Transport","Big-Truck","--Plane"))
x text
1 12 --Car
2 14 -Transport
3 15 Big-Truck
4 178 --Plane
Expected output:
x text
1 12 Car
2 14 Transport
3 15 Big-Truck
4 178 Plane
You can use gsub and the following regex "^\\-+". ^ states that the match should be at the beginning of the string, and that it should be 1 or more (+) hyphen (\\-).
gsub("^\\-+", "", df$text)
# [1] "Car" "Transport" "Big-Truck" "Plane"
If there are whitespaces in the beginning of the string and you want to remove them, you can use [ -]+ in your regex. It tells to match if there are repeated whitespaces or hyphens in the beginning of your string.
gsub("^[ -]+", "", df$text)
To apply this to the dataframe, just do this. In tidyverse, you can also use str_remove:
df$text <- gsub("^\\-+", "", df$text)
# or, in dplyr
library(tidyverse)
df %>%
mutate(text1 = gsub("^\\-+", "", text),
text2 = str_remove(text, "^\\-+"))
You could use trimws to remove certain leading/trailing characters.
trimws(df$text, whitespace = '[ -]')
# [1] "Car" "Transport" "Big-Truck" "Plane"
# a more complex situation
x <- " -- Car - -"
trimws(x, whitespace = '[ -]')
# [1] "Car"
I have transcriptions with erroneous encodings, that is, characters that occur but should not occur.
In this toy data, the only allowed characters are this class:
"[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"
df <- data.frame(
Utterance = c("~°maybe you (.) >should ¥just¥<",
"SOME text |<-- pipe¿ and€", # <--: | and €
"blah%", # <--: %
"text ^more text", # <--: ^
"£norm(hh)a::l£mal, (1.22)"))
What I need to do is:
detect Utterances that contain any wrong encodings
extract the wrong characters
I'm doing OK as far as detection is concerned but the extraction fails miserably:
library(stringr)
library(dplyr)
df %>%
filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ SO, ME, t, ex, |<, --, p, ip, e¿, a, nd
2 blah% bl, ah
3 text ^more text te, xt, ^m, or, t, ex
How can the extraction be improved to obtain this expected result:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
You need to
Ensure the [ and ] are escaped inside a character class
Add whitespace pattern to both regexp checks as its absence is messing your results.
So you need to use
df %>%
filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Output:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
Note that I used positive logic in filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")), so we get all items that contain at least one char other than an allowed one.
I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 "
" an BRCA2 carrier 0.00013612 "
enter code here
aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
Here is my previous answer, updated to reflect a data.frame.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^ and $ are beginning and end of string, respective; \\b is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings
. means one character
? means "zero or one", aka optional; * means "zero or more"; + means "one or more"; all refer to the previous character/class/group
\\s is blank space, including spaces and tabs
[0-9] is a class, meaning any character between 0 and 9; similarly, [a-z] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc
(...) is a saved group; it's not uncommon in a group to use | as an "or"; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern
So grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.
Anything before the number-like string.
Some or no blank space after the number.
Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.
This is my first post in stack overflow and I'll try and explain my problem as succintly as possible.
The problem is pretty simple. I'm trying to identify strings containing alphanumeric characters and alphanumeric characters with symbols and remove them. I looked at previous questions in Stack overflow and found a solution that looks good.
https://stackoverflow.com/a/21456918/7467476
I tried the provided regex (slightly modified) in notepad++ on some sample data just to see if its working (and yes, it works). Then, I proceeded to use the same regex in R and use gsub to replace the string with "" (code given below).
replace_alnumsym <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[_-])[A-Za-z0-9_-]{8,}", "", x, perl = T))
}
replace_alnum <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{8,}", "", x, perl = T))
}
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
output1 <- sapply(sample, replace_alnum)
output2 <- sapply(sample, replace_alnumsym)
The code runs fine but the output still contains the strings. It hasn't been removed. I'm not getting any errors when I run the code (output below). The output format is also strange. Each element is printed twice (once without and once within quotes).
> output1
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
The desired result would be:
> output1
abc def ghi abcd efgh WQWEQtWe_232
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh
I think I'm probably overlooking something very obvious.
Appreciate any assistance that you can provide.
Thanks
Your outputs are not printing twice, they're being output as named vectors. The unquoted line is the element names, the quoted line in the output itself. You can see this by checking the length of an output:
length( sapply( sample, replace_alnum ) )
# [1] 2
So you know there are only 2 elements there.
If you want them without the names, you can unname the vector on output:
unname( sapply( sample, replace_alnum ) )
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
Alternatively, you can rename them something more to your liking:
output <- sapply( sample, replace_alnum )
names( output ) <- c( "name1", "name2" )
output
# name1 name2
# "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
As far as the regex itself, it sounds like what you want is to apply it to each string separately. If so, and if you want them back to where they were at the end, you need to split them by space, then recombine them at the end.
# split by space (leaving results in separate list items for recombining later)
input <- sapply( sample, strsplit, split = " " )
# apply your function on each list item separately
output <- sapply( input, replace_alnumsym )
# recombine each list item as they looked at the start
output <- sapply( output, paste, collapse = " " )
output <- unname( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh "
And if you want to clean up the trailing white space:
output <- trimws( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh"
No idea if this regex-based approach is really fine, but it is possible if we assume that:
alnumsym "words" are non-whitespace chunks delimited with whitespace and start/end of string
alnum words are chunks of letters/digits separated with non-letter/digits or start/end of string.
Then, you may use
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
gsub("\\b(?=\\w*[a-z])(?=\\w*[A-Z])(?=\\w*\\d)\\w{8,}", "", sample, perl=TRUE) ## replace_alnum
gsub("(?<!\\S)(?=\\S*[a-z])(?=\\S*[A-Z])(?=\\S*[0-9])(?=\\S*[_-])[A-Za-z0-9_-]{8,}", "", sample, perl=TRUE) ## replace_alnumsym
See the R demo online.
Pattern 1 details:
\\b - a leading word boundary (we need to match a word)
(?=\\w*[a-z]) - (a positive lookahead) after 0+ word chars (\w*) there must be a lowercase ASCII letter
(?=\\w*[A-Z]) - an uppercase ASCII letter must be inside this word
(?=\\w*\\d) - and a digit, too
\\w{8,} - if all the conditions above matched, match 8+ word chars
Note that to avoid matching _ (\w matches _) you need to replace \w with [^\W_].
Pattern 2 details:
(?<!\\S) - (a negative lookbehind) no non-whitespace can appear immediately to the left of the current location (a whitespace or start of string should be in front)
(?=\\S*[a-z]) - after 0+ non-whitespace chars, there must be a lowercase ASCII letter
(?=\\S*[A-Z]) - the non-whitespace chunk must contain an uppercase ASCII letter
(?=\\S*[0-9]) - and a digit
(?=\\S*[_-]) - and either _ or -
[A-Za-z0-9_-]{8,} - if all the conditions above matched, match 8+ ASCII letters, digits or _ or -.
I would like to extract cat and dog in any order
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
What I have now extracts cat and dog, but also the text in-between
stringr::str_extract(string1, "cat.*dog|dog.*cat"
I would like the output to be
cat dog
and
dog cat
for string1 and string2, respectively
You may use sub with the following PCRE regex:
.*(?|(dog).*(cat)|(cat).*(dog)).*
See the regex demo.
Details
.* - any 0+ chars other than line break chars (to match all chars add (?s) at the pattern start)
(?|(dog).*(cat)|(cat).*(dog)) - a branch reset group (?|...|...) matching either of the two alternatives:
(dog).*(cat) - Group 1 capturing dog, then any 0+ chars as many as possible, and Group 2 capturing cat
| - or
(cat).*(dog) - Group 1 capturing cat, then any 0+ chars as many as possible, and Group 2 capturing dog (in a branch reset group, group IDs reset to the value before the group + 1)
.* - any 0+ chars other than line break chars
The \1 \2 replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog or cat, a space, and a cat or dog).
See an R demo online, too:
x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"
To return NA in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn to apply custom replacement logic:
> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"
Here,
^ - start of the string anchor
(?:.*((dog).*(cat)|(cat).*(dog)).*|.*) - a non-capturing group that matches either of the two alternatives:
.*((dog).*(cat)|(cat).*(dog)).*:
.* - any 0+ chars as many as possible
((dog).*(cat)|(cat).*(dog)) - a capturing group matching either of the two alternatives:
(dog).*(cat) - dog (Group 2, assigned to a variable), any 0+ chars as many as possible, and then cat (Group 3, assigned to b variable)
|
(cat).*(dog) - dog (Group 4, assigned to y variable), any 0+ chars as many as possible, and then cat (Group 5, assigned to z variable)
.* - any 0+ chars as many as possible
| - or
.* - any 0+ chars
$ - end of the string anchor.
The x in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA.
We can use str_extract_all from the stringr package with capture groups.
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"
library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
#
# [[2]]
# [1] "dog" "cat"
#
# [[3]]
# character(0)
We can also set simplify = TRUE. The output would be a matrix.
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
# [,1] [,2]
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] "" ""
Or,
> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"
> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"