Remove specific words and symbol from a df - r

I have a dataframe structure like this, 39 rows
text.
"A" OR "B" OR "C"
"C" OR "D" OR "E"
and a "black list" of words that I want to delete, that begin and end with the symbol ". (200 words) here an example:
blackList
"A"
"D"
i want to remove them from the starting dataframe, obtaining:
text.
OR "B" OR "C"
"C" OR OR "E"
how can I do? I tried with removeWords, but it does not read the symbol ".

We could create a pattern by pasting all the blacklisted items together with "|" as collapsable argument and then remove all of them.
df$text <- gsub(paste0(blacklist$blackList, collapse = "|"), "", df$text)
df
# text
#1 OR "B" OR "C"
#2 "C" OR OR "E"
data
df <- data.frame(text = c('"A" OR "B" OR "C"','"C" OR "D" OR "E"'))
blacklist <- data.frame(blackList = c('"A"', '"D"'))

gsub('\"A\"', "", '"A" OR "B" OR "C"')
escape the quotes with a backslash and use gsub

Related

R: strsplit on negative lookaround

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Trim text after character for every item in list - R

I am trying to remove the text before and including a character ("-") for every element in a list.
Ex-
x = list(c("a-b","b-c","c-d"),c("a-b","e-f"))
desired output:
"b" "c" "d"
"b" "f"
I have tried using various combinations of lapply and gsub, such as
lapply(x,gsub,'.*-','',x)
but this just returns a null list-
[[1]]
[1] ""
[[2]]
[1] ""
And only using
gsub(".*-","",x)
returns
"d\")" "f\")"
You are close, but using lapply with gsub, R doesn't know which arguments are which. You just need to label the arguments explicitly.
x <- list(c("a-b","b-c","c-d"),c("a-b","e-f"))
lapply(x, gsub, pattern = "^.*-", replacement = "")
[[1]]
[1] "b" "c" "d"
[[2]]
[1] "b" "f"
This can be done with a for loop.
val<-list()
for(i in 1:length(x)){
val[[i]]<-gsub('.*-',"",x[[i]])}
val
[[1]]
[1] "b" "c" "d"
[[2]]
[1] "b" "f"

Extract substring after the final dot

I want to implement a regex to extract the substring after the final dot.
For example,
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
gsub(".*(\\..*)$", "\\1", a)
The code returns
".d" ".e" "c" ".e" ".z"
How do I modify the code to get
"d" "e" "" "e" "z"
That is to say, if the string contains dot, it will remove the last part without the dot; if the string doesn't contain dot, it will return "".
Here is a way to do this using sub without capture groups. We can try replacing all content up to and including the final dot with empty string.
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
sub(".*\\.", "", a)
[1] "d" "e" "c" "e" "z"
If you want to return empty string should the input have no dot, then we can use ifelse with grepl:
input <- "Hello World!"
output <- ifelse(grepl("\\.", input), sub(".*\\.", "", input), "")
The reason for the verbose above code is that sub by default just returns the original string should no match be found. But, in your case, you want a different behavior.
You need . outside the capture group as you don't need it
sub(".*\\.(.*)", "\\1", a)
#[1] "d" "e" "c" "e" "z"
This will capture everything after the last dot.
For strings where we have no dots, we could check for it using grepl and then extract
ifelse(grepl("\\.", a), sub(".*\\.(.*)", "\\1", a), "")
#[1] "d" "e" "" "e" "z"

Why does asterisk wildcard fail with sub() command? [r]

When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?
If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"

Covert string into a vector of letter in R

Suppose I have a string in R as "aa1122ddccdsadsa"
I want to convert any string into a vector of letters, how can I do that ?
I mean give an string, I want to it be
"a" "a" "1" "1" "2" etc
There are actually lots of ways to do this. Here's one using strsplit and regex:
x <- c("aa1122ddccdsadsa")
strsplit(gsub("([[:alnum:]]{1})", "\\1 ", x), " ")[[1]]
> strsplit(gsub("([[:alnum:]]{1})", "\\1 ", x), " ")[[1]]
[1] "a" "a" "1" "1" "2" "2" "d" "d" "c" "c" "d" "s" "a" "d" "s" "a"
You could also use substring or plyr.
Using stri_sub function from stringi package
Get substrings of length 1 from 1,2,3... letter
require(stringi)
x <- "alamakota"
stri_sub(x,from=1:stri_length(x),length = 1)
## [1] "a" "l" "a" "m" "a" "k" "o" "t" "a"

Resources