Removing text between parentheses with unmatched pairs - r

I am trying to remove characters/numbers between parentheses. Firstly, the numbered parentheses - i.e. ("(3)") - at the start, and then anything in the second pair of parentheses. Sometimes this second pair of parentheses has an unmatched bracket which complicates things. An example:
library(qdapRegex)
n <- c("(1) Apple (Pe(ar)", "(2) Apple (Or(ang)e)", "(3) Banana (Hot(dog)")
c <- rm_between(n,"(",")", extract = TRUE)
To ideally get:
c
> "Apple" "Apple" "Banana"

It seems that you always need the second word. If that is the case then here are a couple of (straightforward) ways of doing it,
#Base R
sapply(strsplit(n, ' '), `[`, 2)
[1] "Apple" "Apple" "Banana"
#The always fun, word() from stringr package
stringr::word(n, 2)
[1] "Apple" "Apple" "Banana"

If you want to use regex, then you could use a replace regex with empty string like this:
[^A-Za-z ]
Or with insensitive flag
(?i)[^a-z ]
Regex demo

Related

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

Issue with gsub and str_replace_all in R

I need to replace 38 different types of expressions in the following format "IDENTIFIER:ABC:DEF", "IDENTIFIER:GHI:JKL", etc. with regular expressions like "apple" and "banana". I've tried using str_replace_all in the following format:
df$column <- df$column %>% str_replace_all("IDENTIFIER:ABC:DEF", "apple")
df$column <- df$column %>% str_replace_all("IDENTIFIER:GHI:JKL", "banana")
However, for some reason, R only processes about half of my requests. I've checked and double checked for errors and tried to break up the code but no success.
So then I tried the same with gsub:
df$column <- gsub(df$column, "IDENTIFIER:ABC:DEF", "apple")
df$column <- gsub(df$column, "IDENTIFIER:GHI:JKL", "banana")
and I get this error: "In gsub(df$column ...): argument 'pattern' has length > 1 and only the first element will be used".
I'm really not sure what to do next. Any advice?
gsubfn in the package of the same name provides a superset of gsub functionality and in particular it can optionally take a list as a replacement instead of a string. For each match to the regular expression if the match equals one of the list names it is replaced with the corresponding list value.
library(gsubfn)
x <- "xyz IDENTIFIER:ABC:DEF abc IDENTIFIER:GHI:JKL def" # test input
L <- list("IDENTIFIER:ABC:DEF" = "apple", "IDENTIFIER:GHI:JKL" = "banana")
gsubfn("\\y\\S+\\y", L, x)
## [1] "xyz apple abc banana def"
This also works:
gsubfn("\\b\\S+\\b", L, x, perl = TRUE)
## [1] "xyz apple abc banana def"

Substituting lone characters or chars separated by comma

I have the following data and am looking to substitute only single characters.
A,Apple
A
I want to produce the output such that
Banana,Apple
Banana
In other words I want to substitute anything that has an A, or just an A with banana. But if another word starting with A comes in, I want to ignore that.
I tried
gsub("A", "Banana"),
gsub("A[^,;]","Banana"),
But this wont work for the first example, the output I get is
Banana,Bpple
Any ideas on how I can achieve this?
Thanks!
If the value is always surrounded by punctuation or line start/end:
text = "A,Apple\nA\nAvocado"
text2 = gsub("(\\b)A(\\b)", "\\1Bananna\\2", text, TRUE, TRUE)
cat(text2)
This captures the punctuation, if any exist, around the "A", and then puts them back using the backreferences \1 and \2. PCRE are used so we can use the \b word boundary match.
Output:
Bananna,Apple
Bananna
Avocado
A non-regex solution could be to split the string on comma (,), change the value to "Banana" if it is equal to "A"
sapply(strsplit(x, ","), function(x) toString(ifelse(x == "A","banana", x)))
#[1] "banana, Apple" "banana"
data
x <- c("A,Apple", "A")

Using Regex to edit a column in R [duplicate]

I've got a column people$food that has entries like chocolate or apple-orange-strawberry.
I want to split people$food by - and get the first entry from the split.
In python, the solution would be food.split('-')[0], but I can't find an equivalent for R.
If you need to extract the first (or nth) entry from each split, use:
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word,"-"), `[`, 1)
#[1] "apple" "chocolate"
Or faster and more explictly:
vapply(strsplit(word,"-"), `[`, 1, FUN.VALUE=character(1))
#[1] "apple" "chocolate"
Both bits of code will cope well with selecting whichever value in the split list, and will deal with cases that are outside the range:
vapply(strsplit(word,"-"), `[`, 2, FUN.VALUE=character(1))
#[1] "orange" NA
For example
word <- 'apple-orange-strawberry'
strsplit(word, "-")[[1]][1]
[1] "apple"
or, equivalently
unlist(strsplit(word, "-"))[1].
Essentially the idea is that split gives a list as a result, whose elements have to be accessed either by slicing (the former case) or by unlisting (the latter).
If you want to apply the method to an entire column:
first.word <- function(my.string){
unlist(strsplit(my.string, "-"))[1]
}
words <- c('apple-orange-strawberry', 'orange-juice')
R: sapply(words, first.word)
apple-orange-strawberry orange-juice
"apple" "orange"
I would use sub() instead. Since you want the first "word" before the split, we can simply remove everything after the first - and that's what we're left with.
sub("-.*", "", people$food)
Here's an example -
x <- c("apple", "banana-raspberry-cherry", "orange-berry", "tomato-apple")
sub("-.*", "", x)
# [1] "apple" "banana" "orange" "tomato"
Otherwise, if you want to use strsplit() you can round up the first elements with vapply()
vapply(strsplit(x, "-", fixed = TRUE), "[", "", 1)
# [1] "apple" "banana" "orange" "tomato"
I would suggest using head rather than [ in R.
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word, "-"), head, 1)
# [1] "apple" "chocolate"
dplyr/magrittr approach:
library(magrittr)
library(dplyr)
word = c('apple-orange-strawberry', 'chocolate')
strsplit(word, "-") %>% sapply(extract2, 1)
# [1] "apple" "chocolate"
Using str_remove() to delete everything after the pattern:
df <- data.frame(words = c('apple-orange-strawberry', 'chocolate'))
mutate(df, short = stringr::str_remove(words, "-.*")) # mutate method
stringr::str_remove(df$words, "-.*") # str_remove example
stringr::str_replace(df$words, "-.*", "") # str_replace method
stringr::str_split_fixed(df$words, "-", n=2)[,1] # str_split method similar to original question's Python code
tidyr::separate(df, words, into = c("short", NA)) # using the separate function
words short
1 apple-orange-strawberry apple
2 chocolate chocolate
stringr 1.5.0 introduced str_split_i to do this easily:
library(stringr)
str_split_i(c('apple-orange-strawberry','chocolate'), "-", 1)
[1] "apple" "chocolate"
The third argument represents the index you want to extract. Also new is that you can use negative values to index from the right:
str_split_i(c('apple-orange-strawberry','chocolate'), "-", -1)
[1] "strawberry" "chocolate"

How to delete all strings except some specific letters in R?

after researching for a while, I didn't find exactly what I would like.
What I'd like to do is to keep an exact pattern in a string.
So this is my example:
text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS")
how to get exactly "THIS" in all strings:
res=c("THIS","THIS","THIS","","")
I tried gsubin r, but I don't know how to match characters.
For example I tried:
gsub("(THIS).*", "\\1", text) # This delete all string after "THIS".
gsub(".*(THIS)", "\\1", text) # This delete all string before "THIS".
To extract THIS or THAT as whole words, you may use the following regex:
\b(THIS|THAT)\b
where \b is a word boundary and (...|...) is a capturing group with | alternation operator (that can appear more than once, more alternatives can be added).
Since regmatches with gregexpr return a list of vectors with some empty entries whenever no match is found, you need to convert them into NA first, then unlist, and then turn to "".
Here is some base R code:
> text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS", "THAT is something I need, too")
[1] "THIS" "THIS" "THIS" "" "" ""
> matches <- regmatches(text, gregexpr("\\b(THIS|THAT)\\b", text))
> res <- lapply(matches, function(x) if (length(x) == 0) NA else x)
> res[is.na(res)] <- ""
> unlist(res)
[1] "THIS" "THIS" "THIS" "" "" "THAT"
We can use str_extract
library(stringr)
str_extract(text, "THIS")
#[1] "THIS" "THIS" "THIS" NA
It is better to have NA rather than ""
This will first delete elements which don't match THIS and then follows your original idea while storing intermediate result to a variable. It seems that you want to have empty strings for elements that do not match, and last line does that.
tmp <- text[grepl("THIS", text)]
gsub("(THIS).*", "\\1", tmp) -> tmp
gsub(".*(THIS)", "\\1", tmp) -> tmp
c(tmp, rep("", length(text) - length(tmp)))
gsub("[^THIS]","",text) seems to do the trick? "[^THIS]" matches everything except for THIS, and gsub replaces those matches with the empty string given as the second parameter. see comment, doesn't work as expected.

Resources