gsub function, exact match of the pattern

gsub function, exact match of the pattern - r

I have a list of words included in the data frame called remove. I want to remove all the words in text. I want to remove the exact words.
remove <- data.frame("the", "a", "she")
text <- c("she", "he", "a", "the", "aaaa")
for (i in 1:3) {
text <- gsub(data[i, 1], "", text)
}
Attached is the result returned
#[1] "" "he" "" "" ""
However what I am expecting is
#[1] "" "he" "" "" "aaaa"
I also tried the following code, but it does return the expected result:
for (i in 1:3) {
text <- gsub("^data[i, 1]$", "", text)
}
Thanks so much for your help.

For exact match, use value matching (%in%)
remove<-c("the","a","she") #I made remove a vector too
replace(text, text %in% remove, "")
#[1] "" "he" "" "" "aaaa"

A simple base R solution is:
text[!text %in% as.vector(unlist(remove, use.names = FALSE))]

Related

How can I split a string "a\n\nb" into c("a", "", "", "b")?

When I stringr::str_split "\n\na\n" by "\n", I obtain c("", "", "a", "").
I expected I can obtain c("a", "", "", "b") when I stringr::str_split "a\n\nb" by "\n", but I obtained c("a", "", "b") instead. How can I obtain c("a", "", "", "b") by splitting "a\n\n\a"?
Try:
stringr::str_split("a\n\nb", "\n")
Expect:
c("a", "", "", "b")
Result:
c("a", "", "b")

Assuming that we want to extract each character separately, just split on the blank and remove all the \n to return blanks for the split elements
str_remove_all(strsplit(str1, "")[[1]], "\n")
[1] "a" "" "" "b"
Or use perl = TRUE with \n?
strsplit(str1, "\n?", perl = TRUE)[[1]]
[1] "a" "" "" "b"
If it should be at each \n only
strsplit(str_replace(str1, "(\n\\S)", "\n\\1"), "\n")[[1]]
[1] "a" "" "" "b"
data
str1 <- "a\n\nb"

You are getting the correct result, even if it's not what you expect. The length of the result equals the number of delimiters in the string + 1. Thus if you want an extra field, you need to add an extra delimiter:
x1 <- "a\n\nb"
x2 <- "abc\n\ndef"
strsplit(gsub("(\n{2,})", "\\1\n", x1), "\n")[[1]]
[1] "a" "" "" "b"
strsplit(gsub("(\n{2,})", "\\1\n", x2), "\n")[[1]]
[1] "abc" "" "" "def"

How to replace certain rows in a data frame without using a for loop

I'd like to replace certain rows of my variable group with blank. How can I do it without using a for loop? Here is my code. The goal is to code 'group' to be: "A", blank, blank, blank, blank, "A", blank, blank...
group <- rep("A", 20)
var <- rep("B", 20)
out <- data.frame(group, var)
out$row_num <- seq(1:nrow(out))
for (i in 1:nrow(out)) {
if (out$row_num[i] %% 5 != 1) {
out$group[i] <- " "
}
}

These operations are vectorized. So, the replacement can be done without a for loop. Also, from R 4.0, the default option while constructing data.frame is stringsAsFactors = FALSE
out$group[out$row_num %% 5 != 1] <- ' '
Based on the update, if the intention is to replicate
rep(c("A", "", "", "", ""), length.out = 20)
#[1] "A" "" "" "" "" "A" "" "" "" "" "A" "" "" "" "" "A" "" "" "" ""

How to Apply String Vector to Logical Vector

I would like to replace any instances of TRUE in a logical vector with the corresponding elements of a same-lengthed string vector.
For example, I would like to combine:
my_logical <- c(TRUE, FALSE, TRUE)
my_string <- c("A", "B", "C")
to produce:
c("A", "", "C")
I know that:
my_string[my_logical]
gives:
"A" "C"
but can't seem to figure out how to return a same-lengthed vector. My first thought was to simply multiply the vectors together, but that raises the error "non-numeric argument to binary operator."

Another option with replace
replace(my_string, !my_logical, "")
#[1] "A" "" "C"

What about:
my_logical <- c(TRUE, FALSE, TRUE)
my_string <- c("A", "B", "C")
my_replace <- ifelse(my_logical==TRUE,my_string,'')
my_replace
[1] "A" "" "C"
Edit, thanks #www:
ifelse(my_logical, my_string, "")

Maybe:
my_string[ !my_logical ] <- ""
my_string
# [1] "A" "" "C"
Of course this overwrites existing object.

Use ifelse to add NA when my_logical equals FALSE (TRUE otherwise). Use this to subset.
new <- my_string[ifelse(!my_logical, NA, T)]
new
[1] "A" NA "C"
If you want "" over NA do this next.
new[is.na(new)] <- ""
[1] "A" "" "C"

Delete pattern in string and semicolon before and/or after (R)

In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.

Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""

We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""

How to use the strsplit function with a period

I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?

When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.

You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"

Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

gsub function, exact match of the pattern - r

For exact match, use value matching (%in%) remove<-c("the","a","she") #I made remove a vector too replace(text, text %in% remove, "") #[1] "" "he" "" "" "aaaa"

A simple base R solution is: text[!text %in% as.vector(unlist(remove, use.names = FALSE))]

Related

How can I split a string "a\n\nb" into c("a", "", "", "b")?

How to replace certain rows in a data frame without using a for loop

How to Apply String Vector to Logical Vector

Delete pattern in string and semicolon before and/or after (R)

How to use the strsplit function with a period

Categories

Resources