Why does asterisk wildcard fail with sub() command? [r] - r

When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?

If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"

Related

R: strsplit on negative lookaround

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Remove specific words and symbol from a df

I have a dataframe structure like this, 39 rows
text.
"A" OR "B" OR "C"
"C" OR "D" OR "E"
and a "black list" of words that I want to delete, that begin and end with the symbol ". (200 words) here an example:
blackList
"A"
"D"
i want to remove them from the starting dataframe, obtaining:
text.
OR "B" OR "C"
"C" OR OR "E"
how can I do? I tried with removeWords, but it does not read the symbol ".
We could create a pattern by pasting all the blacklisted items together with "|" as collapsable argument and then remove all of them.
df$text <- gsub(paste0(blacklist$blackList, collapse = "|"), "", df$text)
df
# text
#1 OR "B" OR "C"
#2 "C" OR OR "E"
data
df <- data.frame(text = c('"A" OR "B" OR "C"','"C" OR "D" OR "E"'))
blacklist <- data.frame(blackList = c('"A"', '"D"'))
gsub('\"A\"', "", '"A" OR "B" OR "C"')
escape the quotes with a backslash and use gsub

Extract substring after the final dot

I want to implement a regex to extract the substring after the final dot.
For example,
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
gsub(".*(\\..*)$", "\\1", a)
The code returns
".d" ".e" "c" ".e" ".z"
How do I modify the code to get
"d" "e" "" "e" "z"
That is to say, if the string contains dot, it will remove the last part without the dot; if the string doesn't contain dot, it will return "".
Here is a way to do this using sub without capture groups. We can try replacing all content up to and including the final dot with empty string.
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
sub(".*\\.", "", a)
[1] "d" "e" "c" "e" "z"
If you want to return empty string should the input have no dot, then we can use ifelse with grepl:
input <- "Hello World!"
output <- ifelse(grepl("\\.", input), sub(".*\\.", "", input), "")
The reason for the verbose above code is that sub by default just returns the original string should no match be found. But, in your case, you want a different behavior.
You need . outside the capture group as you don't need it
sub(".*\\.(.*)", "\\1", a)
#[1] "d" "e" "c" "e" "z"
This will capture everything after the last dot.
For strings where we have no dots, we could check for it using grepl and then extract
ifelse(grepl("\\.", a), sub(".*\\.(.*)", "\\1", a), "")
#[1] "d" "e" "" "e" "z"

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??
strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"
split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.
If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Substring first character from right

I want to be able to substring the first character from the right hand side of each element of a vector
ABC20
BCD3
B1
AB2222
BX4444
so for the group above I would want, C, D, B, B, X .... is there an easy way to this? I know there is a substr and a numindex/charindex. So I think I can use these but not sure exactly in R.
You can use library stringi,
stringi::stri_extract_last_regex(x, '[A-Z]')
#[1] "C" "D" "B" "B" "X"
DATA
x <- c('ABC20', 'BCD3', 'B1', 'AB2222', 'BX4444')
Try this:
Your data:
list<-c("ABC20","BCD3","B1","AB2222","BX4444")
Identify position
number_pos<-gregexpr(pattern ="[0-9]",list)
number_first<-unlist(lapply(number_pos, `[[`, 1))
Extraction
substr(list,number_first-1,number_first-1)
[1] "C" "D" "B" "B" "X"
We can use sub to capture the last upper case letter (([A-Z])) followed by zero or more digits (\\d*) until the end ($) of the string and replace it with the backreference (\\1) of the captured group
sub(".*([A-Z])\\d*$", "\\1", x)
#[1] "C" "D" "B" "B" "X"
data
x <- c("ABC20", "BCD3", "B1", "AB2222", "BX4444")

Resources